scrapy使用六：爬取动态网页

ajax介绍与网页展示
从js文件读取内容
构造目标地址
t视频评论爬虫

一、ajax介绍

ajax：asynchronous javascript and xml，异步js和xml，一种创建交互式网页应用的网页开发技术

通过在后台与服务进行少量数据交换，AJAX可以使网页实现异步更新。这意味着，可以在重新加载整个网页的情况下，对网页的某部分进行更新。

AJAX网页特点：

页面加载快速
不刷新网页就能更新信息
源代码内容与网页内容不同

二、从js文件读取内容

审查元素列出js文件
寻找可疑文件
解析js文件内容

三、构造目标地址：

根据规律构造
来自文件
手动生成

四、找出评论

http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html

http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20

http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback=

创建项目、爬虫

scrapy startproject qqvideo
cd qqvideo
scrapy genspider videocomment ""

编写items.py

import scrapy
class QqvideoItem(scrapy.Item):
    name = scrapy.Field()
    content = scrapy.Field()
    ctime = scrapy.Field()

编写爬虫文件：

由于视频的评论，由js加载，不能直接在网页中找到。

因此，

1.打开：http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html

2.检查元素，network，然后刷新评论，然后一条一条的检查js，查看response，是否有评论的内容。

3.在这条js中，找到了评论相关的内容：https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6014751539826595774&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=9&_=1538742461708

打开以上地址，右键(open link in new tab)：

这条js的response是，里面的content就是评论的内容：

_varticle1164818577commentv2({"errCode":0,"data":
{"targetid":1164818577,"oritotal":34,"orireqnum":10,"oriretnum":10,"first":"6014058098125775528","last":"6014042988833472860","hasnext":true,
"oriCommList":[
　　{"targetid":"1164818577","parent":"0","time":"1433863186","userid":"194169133",
　　"content":"\u7b49\u4e86","up":"26","pokenum":"0","type":1,"repnum":1,"checkhotscale":"0","checktype":"1","checkstatus":"1",
"isdeleted":0,"custom":"","puserid":"0","orireplynum":"1","voteid":0,"guessid":0,"richtype":"0","rootid":"0","thirdid":"","id":"6014058098125775528","indexscore":12366},
..............

取出评论的地址，只需要一部分即可：

https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&reqnum=10

4.可以看出翻页的时侯，是从以实际的commentid为准；因此，要翻页，还需要找出commentid，再拼接url

5.commentid从哪里获取？在本页评论中，取到最后一条数据commentid

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisSpider
import re
import json
from qqvideo.items import QqvideoItem


class VideocommentSpider(scrapy.Spider):
    name = "videocomment"
    redis_key = "qqvc:start_urls"
    start_urls = ['http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html']
    comment_url = "http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20"
    sns_url = "http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback="

    def parse(self, response):
        vid = re.search('vid:"(.*?)"', response.body, re.S).group(1)
        sns_url = self.sns_url + vid
        yield  scrapy.Request(sns_url, callback="parse_id")

    def parse_id(self, response):
        id = re.search('"comment_id":"(.*?)"', response.body, re.S).group(1)
        commentURL = self.comment_url % id
        yield scrapy.Request(commentURL, callback="parse_comment")

    def parse_comment(self, response):
        jsDict = json.loads(response.body)
        jsData = jsDict['data']
        comments = jsData['commentid']
        for each in comments:
            item = QqvideoItem()
            item['content'] = each['content']
            item['name'] = each['userinfo']['nick']
            item['ctime'] = each['timeDifference']
            yield item

编写pipelines.py文件：

在此之前，先在settings.py中添加配置

开启：pipeline

    ITEM_PIPELINES = ['qqvideo.pipelines.QqvideoPipeline']

配置mongoDB：

    MONGODB_HOST = '127.0.0.1'
    MONGODB_PORT = 27017
    MONGODB_DBNAME = 'spiders'
    MONGODB_DOCNAME = "qqvideocomment"

from scrapy.conf import settings
import pymongo
class QqvideoPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        dbName = settings['MONGODB_DBNAME']
        table = settings['MONGODB_DOCNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[dbName]
        self.table = db[table]
    def process_item(self, item, spider):
        commentInfo = dict(item)
        self.table.insert(commentInfo)
        return item

posted on 2018-10-05 19:05 myworldworld 阅读(500) 评论(0) 收藏举报

刷新页面返回顶部

myworldworld