【python3】爬取简书评论生成词云

一、起因:

      昨天在简书上看到这么一篇文章《中国的父母,大都有毛病》,看完之后个人是比较认同作者的观点。

     不过,翻了下评论,发现评论区争议颇大,基本两极化。好奇,想看看整体的评论是个什么样,就写个爬虫,做了词云。

二、怎么做:

     ① 观察页面,找到获取评论的请求,查看评论数据样式,写爬虫

     ② 用 jieba 模块,将爬取的评论做分词处理

     ③ 用 wordcloud 模块,生成词云

三、代码如下:      

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests,json,time
import jieba
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator

# 存储爬取结果
def write(path,text):
    with open(path,'a', encoding='utf-8') as f:
        f.writelines(text)
        f.write('\n')

# 爬取评论
def getcomments(num,path):
    url = 'https://www.jianshu.com/notes/23437010/comments?comment_id=&author_only=false&since_id=0&max_id=1586510606000&order_by=likes_count&page='+str(num)
    response = requests.get(url).text
    response = json.loads(response)
    num = response['total_pages']
    for i in response['comments']:
        comment = BeautifulSoup(i['compiled_content'],'lxml').text
        write(path,comment)
    return num

# jieba 分词
def read(path):
    text=''
    with open(path, encoding='utf-8') as s:
        for line in s.readlines():
            line.strip()
            text += ' '.join(jieba.cut(line))
    return text

# WordCloud 生成词云
def wordcloud(imagepath):
    backgroud_Image = plt.imread(imagepath)
    wc = WordCloud(background_color='white',  # 设置背景颜色
                   mask=backgroud_Image,  # 设置背景图片
                   max_words=2000,  # 设置最大现实的字数
                   stopwords=STOPWORDS,  # 设置停用词
                   font_path='C:/Users/Windows/fonts/msyh.ttf',  # 设置字体格式,如不设置显示不了中文
                   max_font_size=120,  # 设置字体最大值
                   random_state=30,  # 设置有多少种随机生成状态,即有多少种配色方案
                   )
    wc.generate(text)
    image_colors = ImageColorGenerator(backgroud_Image)
    wc.recolor(color_func=image_colors)
    plt.imshow(wc)
    plt.axis('off')
    plt.show()

if __name__ == '__main__':
    path = '评论.txt' # 评论path
    imagepath = 'heart.jpg' #词云背景图path
    print('正在爬取评论')
    i,num=1,2
    while i <= num:
        num=getcomments(i,path) # 爬取评论
        time.sleep(2)
        i += 1
    print('正在分词处理')
    text = read(path)  # jieba 分词处理
    print('正在生成词云')
    wordcloud(imagepath) # WordCloud 生成词云
    print('词云生成成功')

效果:

posted @ 2018-02-09 14:32  TurboWay  阅读(650)  评论(0编辑  收藏  举报