scrapy中selenium的应用 + ai识别文章类型，文章关键词应用

引入

在通过scrapy框架进行某些网站数据爬取的时候，往往会碰到页面动态数据加载的情况发生，如果直接使用scrapy对其url发请求，是绝对获取不到那部分动态加载出来的数据值。但是通过观察我们会发现，通过浏览器进行url请求发送则会加载出对应的动态加载出的数据。那么如果我们想要在scrapy也获取动态加载出的数据，则必须使用selenium创建浏览器对象，然后通过该浏览器对象进行请求发送，获取动态加载的数据值。

今日详情

1.案例分析：

- 需求：爬取网易新闻的国内、国际、军事、航空板块下的新闻数据

- 需求分析：当点击国内超链进入国内对应的页面时，会发现当前页面展示的新闻数据是被动态加载出来的，如果直接通过程序对url进行请求，是获取不到动态加载出的新闻数据的。则就需要我们使用selenium实例化一个浏览器对象，在该对象中进行url的请求，获取动态加载的新闻数据。

当引擎将国内板块url对应的请求提交给下载器后，下载器进行网页数据的下载，然后将下载到的页面数据，封装到response中，提交给引擎，引擎将response在转交给Spiders。Spiders接受到的response对象中存储的页面数据里是没有动态加载的新闻数据的。要想获取动态加载的新闻数据，则需要在下载中间件中对下载器提交给引擎的response响应对象进行拦截，切对其内部存储的页面数据进行篡改，修改成携带了动态加载出的新闻数据，然后将被篡改的response对象最终交给Spiders进行解析操作。

3.selenium在scrapy中的使用流程：

重写爬虫文件的构造方法，在该方法中使用selenium实例化一个浏览器对象（因为浏览器对象只需要被实例化一次）
重写爬虫文件的closed(self,spider)方法，在其内部关闭浏览器对象。该方法是在爬虫结束时被调用
重写下载中间件的process_response方法，让该方法对响应对象进行拦截，并篡改response中存储的页面数据
在配置文件中开启下载中间件

代码实现：

爬虫文件中

# -*- coding: utf-8 -*-
import scrapy
from ..items import WangyiproItem
from selenium import webdriver
from selenium.webdriver import ChromeOptions
class WangyiproSpider(scrapy.Spider):
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])
    bro = webdriver.Chrome(executable_path=r'E:\学习相关\python全栈17期课程及笔记\爬虫相关\Scrapy\WangyiPro\chromedriver.exe',
                           options=option)
    name = 'wangyipro'
    # allowed_domains = ['www.i.com']
    start_urls = ['https://news.163.com/']
    head_url_lst = []
    def content_parse(self,response):
        item = response.meta["item"]
        content_list = response.xpath('//div[@id="endText"]//text()').extract()
        item['content'] = ''.join(content_list)
        yield item

    def parse_detail(self,response):
        div_list = response.xpath('//div[@class="ndi_main"]/div')
        for div in div_list:
            item = WangyiproItem()
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
            item["title"]= title
            yield scrapy.Request(url=detail_url, callback=self.content_parse, meta={'item': item})

    def parse(self, response):
        li_list = response.xpath('//div[@class="ns_area list"]/ul/li')
        li_index = [3, 4, 6, 7]
        for index in li_index:
            li = li_list[index]
            head_url = li.xpath("./a/@href").extract_first()
            self.head_url_lst.append(head_url)
            yield scrapy.Request(url=head_url, callback=self.parse_detail)

    ##关闭selenium
    def closed(self,spider):
        self.bro.quit()

wangyipro.py

items.py中

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()

items.py

middlewares.py中

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from scrapy.http import HtmlResponse
import time

class WangyiproDownloaderMiddleware(object):
    def process_response(self, request, response, spider):
        #指定修改的URL
        head_url_lst = spider.head_url_lst
        bro = spider.bro
        if request.url in head_url_lst:
            bro.get(request.url)
            time.sleep(2)
            js = "window.scrollTo(0,document.body.scrollHeight)"
            ##滚轮下移一次
            bro.execute_script(js)
            time.sleep(2)
            page_text = bro.page_source
            #返回新的响应对象，将原始的响应对象进行替换，根据指定的请求返回响应数据
            return HtmlResponse(url=bro.current_url,body=page_text,encoding="utf-8",request=request)
        return response

middlewares.py

pipelines.py中

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from aip import AipNlp
class WangyiproPipeline(object):
    conn = None
    cursor = None
    """ 你的 APPID AK SK """
    APP_ID = '16204893'
    API_KEY = 'nRc0OOUxDB35NA5L9ElBGsYM'
    SECRET_KEY = 'FDPslNtmzEGeB5fI0RfKVthd0gGDlCmG'
    client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
    ##调用人工智能ai生成关键词
    def keyword(self,title, content):
        keyword = self.client.keyword(title, content)
        tag_list = []
        for k in keyword["items"]:
            tag_list.append(k["tag"])
        return tag_list

    ##调用人工智能ai生成文章类型
    def topic(self,title, content):
        topic = self.client.topic(title, content)
        score_list = []
        for k in topic["item"]:
            for i in topic["item"][k]:
                score_list.append(i["score"])
        score_list.sort()
        height_score = score_list[-1]
        for k in topic["item"]:
            for i in topic["item"][k]:
                if i["score"] == height_score:
                    tag = i["tag"]
                    return tag

    def open_spider(self,spider):
        self.conn = pymysql.Connect(host="127.0.0.1",port=3306,user="root",password ="",db ="wangyi")
        print(self.conn)

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        title = item['title']
        content = item['content']
        word = self.keyword(title,content)
        top = self.topic(title,content)
        try:
            self.cursor.execute('insert into wangyi values ("%s","%s","%s","%s")'%(title, content,word,top))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

pipelines.py

setting.py中以下都需要解注释

DOWNLOADER_MIDDLEWARES = {
   'WangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}



ITEM_PIPELINES = {
   'WangyiPro.pipelines.WangyiproPipeline': 300,
}

数据库的存储：数据结构：

posted @ 2019-05-09 00:48 小萍瓶盖儿阅读(214) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

小萍瓶盖儿

scrapy中selenium的应用 + ai识别文章类型，文章关键词应用

公告