scrapy使用六:爬取动态网页

  • ajax介绍与网页展示
  • 从js文件读取内容
  • 构造目标地址
  • t视频评论爬虫

一、ajax介绍

ajax:asynchronous javascript and xml,异步js和xml,一种创建交互式网页应用的网页开发技术

通过在后台与服务进行少量数据交换,AJAX可以使网页实现异步更新。这意味着,可以在重新加载整个网页的情况下,对网页的某部分进行更新。

AJAX网页特点:

  • 页面加载快速
  • 不刷新网页就能更新信息
  • 源代码内容与网页内容不同

二、从js文件读取内容

  • 审查元素列出js文件
  • 寻找可疑文件
  • 解析js文件内容

三、构造目标地址:

  • 根据规律构造
  • 来自文件
  • 手动生成

四、找出评论

http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html

http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20

http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback=

 创建项目、爬虫

scrapy startproject qqvideo
cd qqvideo
scrapy genspider videocomment ""

 

编写items.py

import scrapy
class QqvideoItem(scrapy.Item):
    name = scrapy.Field()
    content = scrapy.Field()
    ctime = scrapy.Field()

 

编写爬虫文件:

由于视频的评论,由js加载,不能直接在网页中找到。

因此,

1.打开:http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html

2.检查元素,network,然后刷新评论,然后一条一条的检查js,查看response,是否有评论的内容。

3.在这条js中,找到了评论相关的内容:https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6014751539826595774&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=9&_=1538742461708

打开以上地址,右键(open link in new tab):

这条js的response是,里面的content就是评论的内容:

_varticle1164818577commentv2({"errCode":0,"data":
{"targetid":1164818577,"oritotal":34,"orireqnum":10,"oriretnum":10,"first":"6014058098125775528","last":"6014042988833472860","hasnext":true,
"oriCommList":[
  {"targetid":"1164818577","parent":"0","time":"1433863186","userid":"194169133",
  "content":"\u7b49\u4e86","up":"26","pokenum":"0","type":1,"repnum":1,"checkhotscale":"0","checktype":"1","checkstatus":"1",
"isdeleted":0,"custom":"","puserid":"0","orireplynum":"1","voteid":0,"guessid":0,"richtype":"0","rootid":"0","thirdid":"","id":"6014058098125775528","indexscore":12366}, ..............

 

取出评论的地址,只需要一部分即可:

https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&reqnum=10

 

4.可以看出翻页的时侯,是从以实际的commentid为准;因此,要翻页,还需要找出commentid,再拼接url

5.commentid从哪里获取?在本页评论中,取到最后一条数据commentid

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisSpider
import re
import json
from qqvideo.items import QqvideoItem


class VideocommentSpider(scrapy.Spider):
    name = "videocomment"
    redis_key = "qqvc:start_urls"
    start_urls = ['http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html']
    comment_url = "http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20"
    sns_url = "http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback="

    def parse(self, response):
        vid = re.search('vid:"(.*?)"', response.body, re.S).group(1)
        sns_url = self.sns_url + vid
        yield  scrapy.Request(sns_url, callback="parse_id")

    def parse_id(self, response):
        id = re.search('"comment_id":"(.*?)"', response.body, re.S).group(1)
        commentURL = self.comment_url % id
        yield scrapy.Request(commentURL, callback="parse_comment")

    def parse_comment(self, response):
        jsDict = json.loads(response.body)
        jsData = jsDict['data']
        comments = jsData['commentid']
        for each in comments:
            item = QqvideoItem()
            item['content'] = each['content']
            item['name'] = each['userinfo']['nick']
            item['ctime'] = each['timeDifference']
            yield item

 

 

 

编写pipelines.py文件:

在此之前,先在settings.py中添加配置

开启:pipeline

    ITEM_PIPELINES = ['qqvideo.pipelines.QqvideoPipeline']

 

配置mongoDB:

    MONGODB_HOST = '127.0.0.1'
    MONGODB_PORT = 27017
    MONGODB_DBNAME = 'spiders'
    MONGODB_DOCNAME = "qqvideocomment"
from scrapy.conf import settings
import pymongo
class QqvideoPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        dbName = settings['MONGODB_DBNAME']
        table = settings['MONGODB_DOCNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[dbName]
        self.table = db[table]
    def process_item(self, item, spider):
        commentInfo = dict(item)
        self.table.insert(commentInfo)
        return item

 

posted on 2018-10-05 19:05  myworldworld  阅读(500)  评论(0)    收藏  举报

导航