Scrapy 扩展中间件: 针对特定响应状态码，使用代理重新请求

0.参考

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy

1.主要实现

实际爬虫过程中如果请求过于频繁，通常会被临时重定向到登录页面即302，甚至是提示禁止访问即403，因此可以对这些响应执行一次代理请求：

(1) 参考原生 redirect.py 模块，满足 dont_redirect 或 handle_httpstatus_list 等条件时，直接传递 response

(2) 不满足条件(1)，如果响应状态码为 302 或 403，使用代理重新发起请求

(3) 使用代理后，如果响应状态码仍为 302 或 403，直接丢弃

2.代码实现

保存至 /site-packages/my_middlewares.py

from w3lib.url import safe_url_string
from six.moves.urllib.parse import urljoin

from scrapy.exceptions import IgnoreRequest


class MyAutoProxyDownloaderMiddleware(object):

    def __init__(self, settings):
        self.proxy_status = settings.get('PROXY_STATUS', [302, 403])
        # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy
        self.proxy_config = settings.get('PROXY_CONFIG', 'http://username:password@some_proxy_server:port')


    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            settings = crawler.settings
        )        


    # See /site-packages/scrapy/downloadermiddlewares/redirect.py
    def process_response(self, request, response, spider):
        if (request.meta.get('dont_redirect', False) or
                response.status in getattr(spider, 'handle_httpstatus_list', []) or
                response.status in request.meta.get('handle_httpstatus_list', []) or
                request.meta.get('handle_httpstatus_all', False)):
            return response

        if response.status in self.proxy_status:
            if 'Location' in response.headers:
                location = safe_url_string(response.headers['location'])
                redirected_url = urljoin(request.url, location)
            else:
                redirected_url = ''
                
            # AutoProxy for first time
            if not request.meta.get('auto_proxy'):
                request.meta.update({'auto_proxy': True, 'proxy': self.proxy_config})
                new_request = request.replace(meta=request.meta, dont_filter=True)
                new_request.priority = request.priority + 2
                
                spider.log('Will AutoProxy for <{} {}> {}'.format(
                            response.status, request.url, redirected_url))
                return new_request
            
            # IgnoreRequest for second time
            else:
                spider.logger.warn('Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy'.format(
                                    response.status, request.url, self.proxy_status))
                raise IgnoreRequest

        return response

3.调用方法

(1) 项目 settings.py 添加代码，注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'my_middlewares.MyAutoProxyDownloaderMiddleware': 601,
    # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,    
}
PROXY_STATUS = [302, 403]
PROXY_CONFIG = 'http://username:password@some_proxy_server:port'

4.运行结果

2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1
2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403>
2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy
2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy

代理服务器 log：

squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT
squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT

posted @ 2018-07-18 18:47 my8100 阅读(5501) 评论(0) 收藏举报

刷新页面返回顶部

my8100