爬虫综合大作业

要说当今中国小成本制作电影中的佼佼者，不得不提的就是我不是药神，通过爬虫爬取豆瓣数据，分析电影

通过查阅相关资料发现

豆瓣从2017.10月开始全面禁止爬取数据，仅仅开放500条数据，白天1分钟最多可以爬取40次，晚上一分钟可爬取60次数，超过此次数则会封禁IP地址。

所以需要控制数据爬取数量

登陆后获取cookie

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    cookies = {'cookie': 'bid=GOOb4vXwNcc; douban-fav-remind=1; viewed="27611266_26886337"; ps=y; ue="citpys原创分享@163.com"; " \
                   "push_noty_num=0; push_doumail_num=0; ap=1; loc-last-index-location-id="108288"; ll="108288"; dbcl2="187285881:N/y1wyPpmA8"; ck=4wlL'}
    url = "https://movie.douban.com/subject/" + str(id) + "/comments?start=" + str(page * 10) + "&limit=20&sort=new_score&status=P"
    res = requests.get(url, headers=headers, cookies=cookies)
    res.encoding = "utf-8"
    if (res.status_code == 200):
        print("\n第{}页短评爬取成功！".format(page + 1))
        print(url)
    else:
        print("\n第{}页爬取失败！".format(page + 1))

设置间隔时间

time.sleep(round(random.uniform(1, 2), 2))

获取所需信息

name = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i))
    # 下面是个大bug，如果有的人没有评分，但是评论了，那么score解析出来是日期，而日期所在位置spen[3]为空
    score = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[2]/@title'.format(i))
    date = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[3]/@title'.format(i))
    m = '\d{4}-\d{2}-\d{2}'
    match = re.compile(m).match(score[0])
    if match is not None:
        date = score
        score = ["null"]
    else:
        pass
    content = x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/span/text()'.format(i))
    id = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/@href'.format(i))
    try:
        city = get_city(id[0], i)  # 调用评论用户的ID城市信息获取
    except IndexError:
        city = " "
    name_list.append(str(name[0]))
    score_list.append(str(score[0]).strip('[]\''))  # bug 有些人评论了文字，但是没有给出评分
    date_list.append(str(date[0]).strip('[\'').split(' ')[0])
    content_list.append(str(content[0]).strip())
    city_list.append(city)

   # 获取城市信息

 pattern = re.compile('<div id="wrapper">.*?<div id="content">.*?<h1>(.*?) 短评</h1>', re.S)
    global movie_name
    movie_name = re.findall(pattern, res.text)[0]  # list类型

　　把信息存储到csv中

  infos = {'name': name_list, 'city': city_list, 'content': content_list, 'score': score_list, 'date': date_list}
    data = pd.DataFrame(infos, columns=['name', 'city', 'content', 'score', 'date'])
    data.to_csv(str(ID) + "_comments.csv")

　　数据可视化

 geo1 = Geo("", "评论城市分布", title_pos="center", width=1200, height=600,
            background_color='#404a59', title_color="#fff")
    geo1.add("", attr, val, visual_range=[0, 300], visual_text_color="#fff", is_geo_effect_show=False,
            is_piecewise=True, visual_split_number=10, symbol_size=15, is_visualmap=True, is_more_utils=True)
    #geo1.render(csv_file + "_城市dotmap.html")
    page.add_chart(geo1)
    geo2 = Geo("", "评论来源热力图",title_pos="center", width=1200,height=600, background_color='#404a59', title_color="#fff",)
    geo2.add("", attr, val, type="heatmap", is_visualmap=True, visual_range=[0, 50],visual_text_color='#fff', is_more_utils=True)
    #geo2.render(csv_file+"_城市heatmap.html")  # 取CSV文件名的前8位数
    page.add_chart(geo2)
    bar = Bar("", "评论来源排行", title_pos="center", width=1200, height=600 )
    bar.add("", attr, val, is_visualmap=True, visual_range=[0, 100], visual_text_color='#fff',mark_point=["average"],mark_line=["average"],
            is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
    #bar.render(csv_file+"_城市评论bar.html")  # 取CSV文件名的前8位数
    page.add_chart(bar)
    pie = Pie("", "评论来源饼图", title_pos="right", width=1200, height=600)
    pie.add("", attr, val, radius=[20, 50], label_text_color=None, is_label_show=True, legend_orient='vertical', is_more_utils=True, legend_pos='left')
    #pie.render(csv_file + "_城市评论Pie.html")  # 取CSV文件名的前8位数
    page.add_chart(pie)
    page.render(csv_file + "_城市评论分析汇总.html")

评论分点图

我们可以看出北京，上海，南京，杭州，深圳，广州等一二线城市观影人数较多。

根据马斯洛金字塔

人在满足了下层的基础需要之后，就开始需要实现精神需要，一个城市的电影市场繁华程度往往能体现一个城市的繁荣程度。同时一线大城市的荧幕数量总额是超过其他二三线城市的，大城市人口基数庞大，极多的荧幕数量和座位、极高密度的排片场次，让诸多人便捷观影，这样一来票房自然就比其他城市高出不少，活跃的观众评论也多。

影评词云图

用徐峥大大作为原型

高频词汇

中国
题材
现实
煽情
社会
故事
好看
希望

我们不难发现，这是一部成功的反映社会现状的有中国特色的国产电影

随着贫富差距加大，医疗技术不断完成。

有钱不仅能使鬼推磨，还能改生死簿

终有一天，人类在死亡面前也是不平等的！

而穷人所处的环境，将会越来越差，他们从身体是疲惫的，精神是贫瘠的，情绪是焦虑的，他们比富人更容易患病，也更容易死亡，因为贫穷，他们无法在工作之余改善自己的健康，在遭遇重大疾病的时候，很大概率只能等死，他们还会把他们的贫穷继承、遗传下去。

我们可以看出，中国观众也好，社会也好，开始关注于这个社会本身制度的合理性。

它更大的价值在于，洗刷了国产片=烂片的古板影响。好看，是人们的真实反馈。让人看到国产电影崛起的希望。

总的来说，这是一部值得花费两个小时去观看的国产电影。

posted @ 2019-04-28 21:07 天安永龙阅读(500) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

天安永龙

爬虫综合大作业

公告