scrapy(middlewares,获取动态加载数据)

 

先将没用的注释掉,

爬虫中间件(PppSpiderMiddleware)整个类注释掉(爬虫中间件没用,注释掉)

 

 下载中间件,(类方法返回的是下载中间件实例化对象,没用,注释掉)

 

 日志输出,没用,注释掉

 

-----------------------------------------------------------------------------------------------------------------------------------

只研究下载中间件,不研究爬虫中间件(区别:爬虫中间件没经过过滤器过滤,下载中间件经过过滤器过滤)

下载中间件的3个函数

-----------------------------------------------------------------------------------------------------------------------------------

在settings中开启downloader_middlewares

 

 

************************************************UA、代理池*********************************************************

 青青草原灰太狼老师教的:

 

 

 bobo老师教的设置代理:

检测到请求异常后,会使用设置代理后再次发送请求。(若设置代理后发送的请求仍异常,则死循环)

 

 

*********************************************************************************************************

*********************************************动态页面************************************************************

最下的方法如果不能正常关闭浏览器,可以写成

def closed(self,spider):

  self.driver.quit()

 

 

 

 

 

 

from scrapy import signals
from scrapy.http import HtmlResponse
import time
class UuuDownloaderMiddleware:
    def process_response(self, request, response, spider):
        print('!!!!!!开始拦截!!!!!!')
        driver = spider.driver
        driver.get(url=request.url)
        time.sleep(3)
        #   获取html页面数据,包含动态的数据
        page_text = driver.page_source
        # 将文本构造成html
        new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)

        return new_response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
middlewares
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


class UaSpider(scrapy.Spider):
    name = 'ua'
    # allowed_domains = ['www.ua.com']
    start_urls = ['http://scxk.nmpa.gov.cn:81/xk/']
    driver = None
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        self.driver = webdriver.Chrome('chromedriver.exe',chrome_options=chrome_options)
        # browser = webdriver.Chrome('Chromedriver.exe')

    def parse(self, response):
        li_list = response.xpath('//*[@id="gzlist"]/li')
        for li in li_list:
            res = li.xpath('./dl/@title').get()
            print(res)
    def close(self):
        self.driver.close()
spider

 

*********************************************************************************************************

posted @ 2021-03-23 02:02  丑矬穷屌  阅读(217)  评论(0)    收藏  举报