scrapy使用六:爬取动态网页
- ajax介绍与网页展示
- 从js文件读取内容
- 构造目标地址
- t视频评论爬虫
一、ajax介绍
ajax:asynchronous javascript and xml,异步js和xml,一种创建交互式网页应用的网页开发技术
通过在后台与服务进行少量数据交换,AJAX可以使网页实现异步更新。这意味着,可以在重新加载整个网页的情况下,对网页的某部分进行更新。
AJAX网页特点:
- 页面加载快速
- 不刷新网页就能更新信息
- 源代码内容与网页内容不同
二、从js文件读取内容
- 审查元素列出js文件
- 寻找可疑文件
- 解析js文件内容
三、构造目标地址:
- 根据规律构造
- 来自文件
- 手动生成
四、找出评论
http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html
http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20
http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback=
创建项目、爬虫
scrapy startproject qqvideo cd qqvideo scrapy genspider videocomment ""
编写items.py
import scrapy class QqvideoItem(scrapy.Item): name = scrapy.Field() content = scrapy.Field() ctime = scrapy.Field()
编写爬虫文件:
由于视频的评论,由js加载,不能直接在网页中找到。
因此,
1.打开:http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html
2.检查元素,network,然后刷新评论,然后一条一条的检查js,查看response,是否有评论的内容。
3.在这条js中,找到了评论相关的内容:https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6014751539826595774&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=9&_=1538742461708
打开以上地址,右键(open link in new tab):
这条js的response是,里面的content就是评论的内容:
_varticle1164818577commentv2({"errCode":0,"data":
{"targetid":1164818577,"oritotal":34,"orireqnum":10,"oriretnum":10,"first":"6014058098125775528","last":"6014042988833472860","hasnext":true,
"oriCommList":[
{"targetid":"1164818577","parent":"0","time":"1433863186","userid":"194169133",
"content":"\u7b49\u4e86","up":"26","pokenum":"0","type":1,"repnum":1,"checkhotscale":"0","checktype":"1","checkstatus":"1",
"isdeleted":0,"custom":"","puserid":"0","orireplynum":"1","voteid":0,"guessid":0,"richtype":"0","rootid":"0","thirdid":"","id":"6014058098125775528","indexscore":12366}, ..............
取出评论的地址,只需要一部分即可:
https://video.coral.qq.com/varticle/1164818577/comment/v2?callback=_varticle1164818577commentv2&reqnum=10
4.可以看出翻页的时侯,是从以实际的commentid为准;因此,要翻页,还需要找出commentid,再拼接url
5.commentid从哪里获取?在本页评论中,取到最后一条数据commentid
# -*- coding: utf-8 -*- import scrapy from scrapy_redis.spiders import RedisSpider import re import json from qqvideo.items import QqvideoItem class VideocommentSpider(scrapy.Spider): name = "videocomment" redis_key = "qqvc:start_urls" start_urls = ['http://http://v.qq.com/cover/q/qviv9yyjn83eyfu/v0016f0gulh.html'] comment_url = "http://coral.qq.com/article/%s/comment?commentid=0&reqnum=20" sns_url = "http://sns.video.qq.com/fcgi-bin/video_comment_id?otype=json&callback=" def parse(self, response): vid = re.search('vid:"(.*?)"', response.body, re.S).group(1) sns_url = self.sns_url + vid yield scrapy.Request(sns_url, callback="parse_id") def parse_id(self, response): id = re.search('"comment_id":"(.*?)"', response.body, re.S).group(1) commentURL = self.comment_url % id yield scrapy.Request(commentURL, callback="parse_comment") def parse_comment(self, response): jsDict = json.loads(response.body) jsData = jsDict['data'] comments = jsData['commentid'] for each in comments: item = QqvideoItem() item['content'] = each['content'] item['name'] = each['userinfo']['nick'] item['ctime'] = each['timeDifference'] yield item
编写pipelines.py文件:
在此之前,先在settings.py中添加配置
开启:pipeline
ITEM_PIPELINES = ['qqvideo.pipelines.QqvideoPipeline']
配置mongoDB:
MONGODB_HOST = '127.0.0.1' MONGODB_PORT = 27017 MONGODB_DBNAME = 'spiders' MONGODB_DOCNAME = "qqvideocomment"
from scrapy.conf import settings import pymongo class QqvideoPipeline(object): def __init__(self): host = settings['MONGODB_HOST'] port = settings['MONGODB_PORT'] dbName = settings['MONGODB_DBNAME'] table = settings['MONGODB_DOCNAME'] client = pymongo.MongoClient(host=host, port=port) db = client[dbName] self.table = db[table] def process_item(self, item, spider): commentInfo = dict(item) self.table.insert(commentInfo) return item
posted on 2018-10-05 19:05 myworldworld 阅读(500) 评论(0) 收藏 举报