使用训练好的word2vector进行文本聚类
尝试了使用词频的词表征进行kmeans,效果不好,所以考虑看看使用word2vec的词表征会有什么不同。
1.加载word2vec
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('word2vector.bigram-char')
文件是网上下载的,使用百度百科语料训练的300维词向量,看下效果:
model.most_similar(['男人'])
[('女人', 0.874478816986084),
('老男人', 0.7225901484489441),
('大男人', 0.7179129123687744),
('女孩', 0.6780898571014404),
('臭男人', 0.6778838038444519),
('中年男人', 0.6763597726821899),
('男孩', 0.6762259006500244),
('真男人', 0.6674383878707886),
('好男人', 0.6661351919174194),
('单身男人', 0.6624549031257629)]
len(model.vocab):635974
2.词嵌入
将我们自己的语料(3万左右新闻数据且抽取了关键词)嵌入word2vec词向量:
#词向量嵌入 from datetime import datetime import numpy as np start = datetime.now() embedding = [] for idx, line in enumerate(keywords): vector = np.zeros(300) for word in line: if word not in model.vocab: vector += np.zeros(300) else: vector += model[word] embedding.append(vector/20) if (idx%100==0): print(idx) end = datetime.now() end-start
因为我每篇文本取20个词,所以将所有词的vector/20取了个均值作为文本的向量
3.使用sklearn的kmeans进行聚类
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=100, random_state=1).fit(embedding) y_pred = kmeans.labels_ cluster_center = kmeans.cluster_centers_
from collections import Counter center_dict = Counter(y_pred) center_dict
查看各个类别的数量
Counter({16: 314,
27: 384,
21: 160,
30: 370,
99: 223,
15: 158,
36: 882,
48: 180,
14: 184,
43: 447,
98: 726,
88: 601,
52: 195,
53: 351,
13: 565,
5: 523,
22: 417,
23: 365,
71: 604,
37: 740,
63: 355,
29: 492,
25: 554,
82: 335,
50: 727,
41: 676,
47: 344,
4: 141,
70: 274,
12: 559,
78: 481,
84: 820,
40: 237,
75: 340,
3: 394,
10: 574,
56: 564,
59: 414,
51: 301,
73: 503,
6: 560,
60: 268,
86: 405,
2: 611,
28: 485,
66: 489,
76: 334,
77: 296,
33: 226,
65: 464,
97: 501,
18: 188,
7: 218,
54: 251,
35: 511,
92: 404,
19: 454,
74: 228,
67: 325,
49: 591,
24: 306,
69: 547,
72: 330,
11: 280,
95: 374,
81: 464,
58: 636,
32: 274,
79: 115,
87: 205,
62: 425,
34: 281,
38: 330,
96: 269,
64: 445,
68: 416,
9: 382,
91: 113,
80: 251,
20: 517,
44: 264,
93: 276,
26: 240,
17: 381,
55: 129,
57: 470,
0: 501,
83: 167,
8: 261,
89: 134,
85: 69,
31: 200,
90: 147,
46: 188,
94: 492,
1: 91,
42: 401,
45: 124,
61: 189,
39: 91})
看起来数量挺平均的,随便拿出来一类看看文章的标题,发现有的好有的坏,总体效果还行。
本来还想用DBSCAN算法,发现时间太久,如果跑需要将vector维度降低,所以就算了。

浙公网安备 33010602011771号