中文词频统计与词云生成

作业来自：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822

中文词频统计

1. 下载一长篇中文小说。

2. 从文件读取待分析文本。

3. 安装并使用jieba进行中文分词。

pip install jieba

import jieba

jieba.lcut(text)

4. 更新词库，加入所分析对象的专业词汇。

jieba.add_word('天罡北斗阵') #逐个添加

jieba.load_userdict(word_dict) #词库文本文件

参考词库下载地址：https://pinyin.sogou.com/dict/

转换代码：scel_to_text

5. 生成词频统计

6. 排序

7. 排除语法型词汇，代词、冠词、连词等停用词。

stops

tokens=[token for token in wordsls if token not in stops]

8. 输出词频最大TOP20，把结果存放到文件里

9. 生成词云。

 1 import jieba
 2 import struct
 3 import os
 4 from wordcloud import WordCloud
 5 import matplotlib.pyplot as plt
 6 from scipy.misc import imread
 7 
 8 result_path = r'C:\Users\LJ\Desktop\wordcloud\result.txt'
 9 fiction_path = r'C:\Users\LJ\Desktop\wordcloud\天龙八部.txt'
10 stops_path = r'C:\Users\LJ\Desktop\wordcloud\stops_chinese.txt'
11 userdict_path = r'C:\Users\LJ\Desktop\wordcloud\userdict\天龙八部词库.txt'
12 def save_result():
13     # 读取小说
14     with open(fiction_path, 'r', encoding='utf8') as f:
15         fiction = f.read().replace('\n', '')
16     #   读取停用词
17     with open(stops_path, 'r', encoding='utf8') as f:
18         stops = f.read().split('\n')
19     #     添加用户自定义字典
20     jieba.load_userdict(userdict_path)
21     #  分词并发挥list
22     wordlist = jieba.lcut(fiction)
23     # 去除停用词
24     wordlist_nostop = [word for word in wordlist if word not in stops]
25     wordfrequency = {}
26     # 统计词频
27     for i in wordlist_nostop:
28         if i not in wordfrequency:
29             wordfrequency[i] = 1
30         else:
31             wordfrequency[i] += 1
32     #   list才可排序 所以把set变为list
33     paixu = list(wordfrequency.items())
34     # 以value排序
35     paixu.sort(key=lambda x: x[1], reverse=True)
36     # Top20
37     paixu = paixu[0:20]
38     result = ''
39     # 取key转成string
40     for i in paixu:
41         result = result + i[0] + ' '
42     #     保存top20
43     with open(result_path, 'w', encoding='utf8') as f:
44         f.write(result)
45 def read_result():
46     #   读取top20
47     with open(result_path, 'r', encoding='utf8') as f:
48         return f.read()
49 save_result()
50 result = read_result()
51 # 读取图片，
52 im = imread(r'C:\Users\LJ\Desktop\mask.jpg')
53 # 遮罩图为im
54 mywc = WordCloud(background_color='pink', mask=im, margin=1).generate(result)
55 plt.imshow(mywc)
56 plt.axis("off")
57 # 显示词云
58 plt.show()

运行结果：

遮罩图：

Top20:

词云：

posted on 2019-03-25 13:42 Lijiajun 阅读(368) 评论(0) 收藏举报

刷新页面返回顶部

中文词频统计与词云生成

导航

公告