scrapy(middlewares,获取动态加载数据)
先将没用的注释掉,
爬虫中间件(PppSpiderMiddleware)整个类注释掉(爬虫中间件没用,注释掉)

下载中间件,(类方法返回的是下载中间件实例化对象,没用,注释掉)

日志输出,没用,注释掉

-----------------------------------------------------------------------------------------------------------------------------------
只研究下载中间件,不研究爬虫中间件(区别:爬虫中间件没经过过滤器过滤,下载中间件经过过滤器过滤)
下载中间件的3个函数

-----------------------------------------------------------------------------------------------------------------------------------
在settings中开启downloader_middlewares

************************************************UA、代理池*********************************************************
青青草原灰太狼老师教的:


bobo老师教的设置代理:
检测到请求异常后,会使用设置代理后再次发送请求。(若设置代理后发送的请求仍异常,则死循环)

*********************************************************************************************************
*********************************************动态页面************************************************************
最下的方法如果不能正常关闭浏览器,可以写成
def closed(self,spider):
self.driver.quit()



from scrapy import signals from scrapy.http import HtmlResponse import time class UuuDownloaderMiddleware: def process_response(self, request, response, spider): print('!!!!!!开始拦截!!!!!!') driver = spider.driver driver.get(url=request.url) time.sleep(3) # 获取html页面数据,包含动态的数据 page_text = driver.page_source # 将文本构造成html new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request) return new_response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass
import scrapy from selenium import webdriver from selenium.webdriver.chrome.options import Options class UaSpider(scrapy.Spider): name = 'ua' # allowed_domains = ['www.ua.com'] start_urls = ['http://scxk.nmpa.gov.cn:81/xk/'] driver = None def __init__(self): chrome_options = Options() chrome_options.add_argument('--headless') self.driver = webdriver.Chrome('chromedriver.exe',chrome_options=chrome_options) # browser = webdriver.Chrome('Chromedriver.exe') def parse(self, response): li_list = response.xpath('//*[@id="gzlist"]/li') for li in li_list: res = li.xpath('./dl/@title').get() print(res) def close(self): self.driver.close()
*********************************************************************************************************

浙公网安备 33010602011771号