在Scrapy中使用selenium

在scrapy中使用selenium

在scrapy中需要获取动态加载的数据的时候，可以在下载中间件中使用selenium
编码步骤：

在爬虫文件中导入webdrvier类
在爬虫文件的爬虫类的构造方法中进行浏览器实例化的操作
在爬虫类close方法中进行浏览器关闭的操作
在下载中间件的peocess_response方法中编写执行浏览器自动化的操作


from scrapy.http import HtmlResponse
class WangyiproDownloaderMiddleware(object):


    def process_request(self, request, spider):

        return None

    # 可以拦截到响应对象(下载器传递给spider的响应对象）
    def process_response(self, request, response, spider):
        '''
        :param request:当前响应对象对应的请求对象
        :param response:拦截到的响应对象
        :param spider:爬虫文件中对应的爬虫类的实例
        :return:
        '''
        # print(request.url+'这是下载中间件')
        # 响应对象中存储页面数据的篡改
        url_list = [
            'http://news.163.com/world/',
            'http://news.163.com/domestic/',
            'http://news.163.com/air/',
            'http://war.163.com/'
        ]
        if request.url in url_list:
            spider.bro.get(url=request.url)
            # page_text页面数据就是包含了动态加载出来的新闻数据对应的页面数据
            page_text = spider.bro.page_source
            # 返回篡改后的响应对象
            return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8')
        else:
            return response

posted @ 2019-01-17 21:24 Wualin 阅读(691) 评论(0) 收藏举报

刷新页面返回顶部

Wualin

在Scrapy中使用selenium

在scrapy中使用selenium

公告