Python爬虫 #020 Scrapy中间件

Scrapy中间件分为下载器中间件,Spider中间件,本文主要介绍下载器中间件Downloader MiddleWare

1. Downloader MiddleWare

  • 下载器中间件是引擎和下载器之间的桥梁,引擎发送request请求给下载器,就可以在中间件中设置有关数据如:随机请求头,ip代理池等,等常见的反爬虫手段

  • 使用中间件时,需要在 setting.py 中设置,例如启用自己创建的UserAgentDownloadMiddleware,其中后面的数值越小,则优先调用。

    DOWNLOADER_MIDDLEWARES = {
       'useragent.middlewares.UseragentDownloaderMiddleware': 543,
       'useragent.middlewares.UserAgentDownloadMiddleware': 100,
    }
    


2. Downloader MiddleWare 设置


2.1 设置随机请求头

  • 测试网址:http://httpbin.org/user-agent

  • 创建scrapy项目:

  • 修改 setting.py:注意注释默认的中间件

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for useragent project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'useragent'
    
    SPIDER_MODULES = ['useragent.spiders']
    NEWSPIDER_MODULE = 'useragent.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'useragent (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400"
    }
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'useragent.middlewares.UseragentSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
        # 注释原来的,改为自定义的
       # 'useragent.middlewares.UseragentDownloaderMiddleware': 543,
       'useragent.middlewares.UserAgentDownloadMiddleware': 543,
    }
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    #    'useragent.pipelines.UseragentPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
  • 编写 item.py:

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class UseragentItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass
    
  • 编写 httpbin.py

    # -*- coding: utf-8 -*-
    import scrapy
    
    class HttpbinSpider(scrapy.Spider):
        name = 'httpbin'
        allowed_domains = ['httpbin.org']
        # 改开始的网址
        start_urls = ['http://httpbin.org/user-agent']
    
        def parse(self, response):
            print(response.text)
            # scrapy 对同一网址发出请求后不会再一次请求(去重)
            # dont_filter=True 表示不去重
            yield scrapy.Request(self.start_urls[0], dont_filter=True)
    
  • 编写 middlewares.py

    ################ 自定义中间件 ################
    import random
    class UserAgentDownloadMiddleware(object):
        USER_AGENTS = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
        ]
        def process_request(self, request, spider):
            # 随机选择
            user_agent = random.choice(self.USER_AGENTS)
            # 把请求头换为随机的
            request.headers['User-Agent'] = user_agent
            # setting中要打开中间件(56行)
    
  • 编写pipeline.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class UseragentPipeline(object):
        def process_item(self, item, spider):
            return item
    

2.2 设置随机ip

  • 测试网址:http://httpbin.org/ip

  • 中间过程基本同上,注意在 setting.py 中启用自己定义的中间件

    DOWNLOADER_MIDDLEWARES = {
       # 'proxy.middlewares.ProxyDownloaderMiddleware': 543,
       'proxy.middlewares.IPProxyDownloadMiddleware': 543,
    }
    
  • 编写 middlewares.py

    ######################## 自定义中间件 ###############################
    import random
    
    class IPProxyDownloadMiddleware(object):
        # 代理需可用才行,不可用会报错(长时间连接)
        PROXYS = [
            "http://39.137.95.73:8080",
            "http://222.95.144.231:3000",
            "http://121.237.149.156:3000",
            "http://121.237.148.251:3000"
            ]
    
        def process_request(self, request, spider):
            proxy = random.choice(self.PROXYS)
            request.meta['proxy'] = proxy
        # setting 激活中间件,并改为自定义的中间件
    

2.3 设置selenium

  • 在Downloader MiddleWare中使用selenium,可以截获scrapy发出的请求,通过Chromedrive发送

    把网页源代码封装成response对象,返回给爬虫,解析数据,可用来访问动态加载的页面。

  • 其他过程基本同上,需要在爬虫文件中设置要爬取的网址,且 setting.py 中启用自定义中间件。

    DOWNLOADER_MIDDLEWARES = {
     # 注释原来的
     # 'jianshu_spider.middlewares.JianshuSpiderDownloaderMiddleware': 543,
      'jianshu_spider.middlewares.SeleniumDownloadMiddleware': 543,
    }
    
  • 编写 middlewares.py

    ########################### 自定义中间件 ############################
    from scrapy import signals
    from selenium import webdriver
    import time
    from scrapy.http.response.html import HtmlResponse
    
    class SeleniumDownloadMiddleware(object):
        def __init__(self):
            self.driver = webdriver.Chrome()
    
        # 截获scrapy发出的请求,通过Chromedrive发送,获得返回数据
        def process_request(self, request, spider):
            self.driver.get(request.url)
            time.sleep(3)
            # 网页源码
            source = self.driver.page_source
            # 把网页源代码封装成response对象,返回给爬虫
            ## current_url即当前的url,
            response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
            return response
    
    
posted @ 2023-06-28 23:02  枫_Null  阅读(13)  评论(0)    收藏  举报