词频统计与词云

本文包括：

1.安装第三方库

2.文本处理

3.词频统计

3.统计可视化

3.呈现词云

安装库

要用到的库有：PIL，wordcloud，numpy，matplotlib，jieba

导入库

1 from PIL import Image
2 import wordcloud 
3 import numpy as np
4 import matplotlib.pyplot as plt
5 import jieba

文本处理

首先导入《斗破苍穹》文本，并用jieba分词

1 with open('斗破苍穹.txt','r',encoding='utf-8') as fo:
2     for i in fo:
3             i=fo.readline()
4             fo1=i.strip('\n')
5             fo1=jieba.lcut(fo1)

分词后会出现很多我们不想要的字，词，标点符号，数字，这时再导入停用词

创建停用词文本

上网搜“中文停用词”，可找到想要的，如：

复制到新建的 “停用词.txt”文档保存，停用词文本就做好了。

导入停用词

1 with open('停用词.txt','r',encoding='utf-8') as fx:
2         a,b=fx.readlines(),[]
3         for i in a:
4             i=i.strip('\n').strip(' ')#第二个 strip 用于把停用词后面的空格去掉，否词停用词不起作用。
5             b.append(i)

用停用词排除

 1 with open('斗破苍穹.txt','r',encoding='utf-8') as fo:
 2         c=[]
 3         for i in fo:
 4             i=fo.readline()
 5             fo1=i.strip('\n')
 6             fo1=jieba.lcut(fo1)
 7             for j in fo1:
 8                 if len(j) !=1:
 9                     if j not in b:
10                         c.append(j)
11        return c

注意：文档统一用UTF-8格式保存，减少后期不必要的麻烦（因为这个搞得我头大）

词频统计

设想是把词频统计的结果分成 “词” 和 “频”两部分用列表储存

 1 def g(n):
 2     # n 为列表
 3     a,c,d={},[],[]
 4     for i in n:
 5         a[i]=a.get(i,0)+1
 6     b=list(a.items())
 7     b.sort(key=lambda x:x[1],reverse=True)
 8     for i in range(20):
 9         e,f=b[i]
10         c.append(e)#向列表加入词
11         d.append(f)#向列表加入词频
12         print('{: <10}{:>10}'.format(e,f))
13     return list([c,d])

效果图

统计可视化

用 matplotlib库中的pyplot模块

1 def h(n):# n 为词频列表
2     a=n
3     name_list =a[0]#词
4     print(name_list)
5     num_list =a[1]#词频
6     plt.bar(range(len(num_list)), num_list,tick_label=name_list,fc='r')
7     plt.show()

结果。。。

脚标成方格，代码狂报错

查了一下，是 matplotlib库默认不识别中文，那就加入中文！在上面代码中插入如下：

1 plt.rcParams['font.sans-serif']=['simHei']
2 plt.rcParams['axes.unicode_minus']=False

运行一下，可行。

呈现词云

词云嘛~~~用云做背景

上代码（注意：scale参数是像素越大越考验电脑，而max_word越大图中出现字越密集，越多。）

 1 def k(n):# n 为词语列表
 2     a=' '.join(n)
 3     mask=np.array(Image.open('云.png')) # 图片模板
 4     b= wordcloud.WordCloud(font_path ='SIMYOU.TTF' ,\
 5                            scale=30,\
 6                              max_words=6000,\
 7                              mask = mask,\
 8                              height= 800,\
 9                              width=800,\
10                              background_color='white',\
11                              repeat=False,\
12                              mode='RGBA')#处理图片 
13     b=b.generate(a)#填充词生成词云
14     b.to_file('词云.png')

效果图如下

自此显示全代码：

 1 from PIL import Image
 2 import wordcloud 
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 import jieba
 6 #文本处理
 7 def f():
 8     with open('停用词.txt','r',encoding='utf-8') as fx:
 9         a,b=fx.readlines(),[]
10         for i in a:
11             i=i.strip('\n').strip(' ')#第二个 strip 用于把停用词后面的空格去掉
12             b.append(i)
13                 
14     with open('斗破苍穹.txt','r',encoding='utf-8') as fo:
15         c=[]
16         for i in fo:
17             i=fo.readline()
18             fo1=i.strip('\n')
19             fo1=jieba.lcut(fo1)
20             for j in fo1:
21                 if len(j) !=1:
22                     if j not in b:
23                         c.append(j)
24     return c
25 
26 def g(n):
27     # n 为列表
28     a,c,d={},[],[]
29     for i in n:
30         a[i]=a.get(i,0)+1
31     b=list(a.items())
32     b.sort(key=lambda x:x[1],reverse=True)
33     for i in range(15):
34         e,f=b[i]
35         c.append(e)#向列表加入词
36         d.append(f)#向列表加入词频
37         print('{: <10}{:>10}'.format(e,f))
38     return list([c,d])
39 
40 def h(n):# n 为词频列表
41     a=n
42     plt.rcParams['font.sans-serif']=['simHei']
43     plt.rcParams['axes.unicode_minus']=False
44     name_list =a[0]#词
45     print(name_list)
46     num_list =a[1]#词频
47     plt.bar(range(len(num_list)), num_list,tick_label=name_list,fc='r')
48     plt.show()
49 
50 def k(n):# n 为词语列表
51     a=' '.join(n)
52     mask=np.array(Image.open('云.png')) # 图片模板
53     b= wordcloud.WordCloud(font_path ='SIMYOU.TTF' ,\
54                            scale=30,\
55                              max_words=6000,\
56                              mask = mask,\
57                              height= 800,\
58                              width=800,\
59                              background_color='white',\
60                              repeat=False,\
61                              mode='RGBA')#处理图片 
62     b=b.generate(a)#填充词生成词云
63     b.to_file('词云.png')#保存图片
64 a=f()#生成词语
65 b=g(a)#词频统计,返回两列表
66 h(b)#词频可视化
67 k(a)

完毕！

参考资料：

https://blog.csdn.net/weixin_44301621/article/details/89510319
http://imhuchao.com/1048.html
文本下载地址：https://www.ibiqiuge.com/8298/

posted @ 2021-04-27 02:07 火蝇阅读(436) 评论(0) 收藏举报

刷新页面返回顶部

火蝇

词频统计与词云

公告