中文维基百科上的word2vec实验，python及java版本

最近一直把以前放下的NLP收拾起来，刚准备做关系抽取，然后把词变成向量的时候看到了Word2Vec这个神器，然后就开始了折腾之路

1.java版的

目前Word2Vec有很多版本，这次主要实验的是python版本，但开始为了省心（就在当前项目内）就先用java版的试试，java版的是ansj的作者孙健搞的，如果我没记错的话，ansj现在已经停止维护了。但搞出来这个新玩意儿，还是试试，倒是很简单，导入项目，学习，然后用，but没有语料，很多效果都没有。

地址：https://github.com/NLPchina/Word2VEC_java，不知道什么原因，在语料规模上来后(1G的中文语料，也不大啊)，java版本的内存会在4.17G的时候挂掉，我怕不够直接给了10G。所以java版本的学习部分在大规模语料上没跑通，回头再试试。

2.Python版

苦于没有大规模语料，所以就又开始了寻觅之路，国家语委，各种分词工具内部的语料库，搜狗语料库，北大中文语料库等等，不是下载不来，就是语料太旧，峰回路转，逛52nlp的时候，找到了52NLP的一个说明，看到了竟然有中文wiki这么高质量的语料，赶紧下手搞到。<实验过程参考：http://www.52nlp.cn/中英文维基百科语料上的word2vec实验>

实验环境：macbook pro i5 16g 256ssd ，python2.7，jdk1.8

实验步骤：

1. 下载语料，直接中文，目前需要

https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

2. 解析wiki

process_wiki.py

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3  
 4 import logging
 5 import os.path
 6 import sys
 7  
 8 from gensim.corpora import WikiCorpus
 9  
10 if __name__ == '__main__':
11     program = os.path.basename(sys.argv[0])
12     logger = logging.getLogger(program)
13  
14     logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
15     logging.root.setLevel(level=logging.INFO)
16     logger.info("running %s" % ' '.join(sys.argv))
17  
18     # check and process input arguments
19     if len(sys.argv) < 3:
20         print globals()['__doc__'] % locals()
21         sys.exit(1)
22     inp, outp = sys.argv[1:3]
23     space = " "
24     i = 0
25  
26     output = open(outp, 'w')
27     wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
28     for text in wiki.get_texts():
29         output.write(space.join(text) + "\n")
30         i = i + 1
31         if (i % 10000 == 0):
32             logger.info("Saved " + str(i) + " articles")
33  
34     output.close()
35     logger.info("Finished Saved " + str(i) + " articles")

将这两个文件放在同一个目录下，执行：python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text：执行结果类似（当时没有截图，借用下）：

2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2015-03-11 17:40:08,329: INFO: Saved 10000 articles
2015-03-11 17:40:45,501: INFO: Saved 20000 articles
2015-03-11 17:41:23,659: INFO: Saved 30000 articles
2015-03-11 17:42:01,748: INFO: Saved 40000 articles
2015-03-11 17:42:33,779: INFO: Saved 50000 articles
......
2015-03-11 17:55:23,094: INFO: Saved 200000 articles
2015-03-11 17:56:14,692: INFO: Saved 210000 articles
2015-03-11 17:57:04,614: INFO: Saved 220000 articles
2015-03-11 17:57:57,979: INFO: Saved 230000 articles
2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles

解析完毕后，需要（1）繁简转化（2）统一为UTF-8编码（3）分词

由于这几项手上直接有东西搞定，所以就没有采用52nlp的产品，反正只要能达到这个目的就可以了

然后需要：train_word2vec_model.py

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3  
 4 import logging
 5 import os.path
 6 import sys
 7 import multiprocessing
 8  
 9 from gensim.corpora import WikiCorpus
10 from gensim.models import Word2Vec
11 from gensim.models.word2vec import LineSentence
12  
13 if __name__ == '__main__':
14     program = os.path.basename(sys.argv[0])
15     logger = logging.getLogger(program)
16  
17     logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
18     logging.root.setLevel(level=logging.INFO)
19     logger.info("running %s" % ' '.join(sys.argv))
20  
21     # check and process input arguments
22     if len(sys.argv) < 4:
23         print globals()['__doc__'] % locals()
24         sys.exit(1)
25     inp, outp1, outp2 = sys.argv[1:4]
26  
27     model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
28             workers=multiprocessing.cpu_count())
29  
30     # trim unneeded model memory = use(much) less RAM
31     #model.init_sims(replace=True)
32     model.save(outp1)
33     model.save_word2vec_format(outp2, binary=False)

执行：python train_word2vec_model.py wiki.zh.text wiki.zh.text.model wiki.zh.text.vector

同上，执行结果

2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2015-03-11 18:50:02,592: INFO: collecting all words and their counts
2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
...
2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5
2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words
2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25
2015-03-11 18:52:29,683: INFO: resetting layer weights
2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
......
2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm
2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

跑完之后就可以在python里使用model了

基本用法：

import gensim
model = gensim.models.Word2Vec.load("wiki.zh.text.model")

>>> result =model.most_similar(u"美女")
>>> for e in result:
...     print e[0],e[1]
... 
帅哥 0.629959464073
正妹 0.607636809349
校花 0.566570997238
美腿 0.560691952705
女明星 0.556897878647
性感 0.548311054707
谐星 0.537560880184
大变身 0.52529746294
女丑 0.517377853394
辣妹 0.506102442741

>>> result = model.most_similar(positive=[u'中国',u'日本'], negative=[u'东京'])
>>> for e in result:
...     print e[0],e[1]
... 
我国 0.525859713554
中国政府 0.455589711666
朝鲜民主主义人民共和国 0.433199852705
中华民国 0.430634796619
全中国 0.429285645485
美国 0.425486922264
境外 0.422223210335
台商 0.420866370201
英国 0.420089453459
中华人民共和国政府 0.41133800149

>>> result = model.most_similar(positive=[u'女人',u'国王'], negative=[u'男人'])
>>> for e in result:
...     print e[0],e[1]
... 
王储 0.538514256477
王室 0.533518970013
四世 0.531962811947
一世 0.531662106514
王后 0.528761506081
王位 0.517430365086
君主 0.513949334621
摄政王 0.50737452507
二世 0.503388166428
六世 0.503049015999

>>> model[u'帅哥']

array([ -5.31498909e-01,  -1.10617805e+00,   1.02419519e+00,

        -3.50866057e-02,   5.56856513e-01,   6.14050031e-01,

         1.03647232e-01,   6.10242724e-01,   2.12321617e-02,

        -5.38967609e-01,  -7.74732232e-01,   2.75299311e-01,

        -4.18679267e-01,   2.29567051e-01,   2.23700061e-01,

        -5.36157131e-01,   6.64938211e-01,  -4.05853897e-01,

         5.77953935e-01,  -4.21773642e-01,  -8.07677925e-01,

        -1.39366493e-01,  -2.69933283e-01,   5.06161451e-01,

         4.67247456e-01,   1.66101696e-03,   7.38345563e-01,

        -6.92869484e-01,   3.19320440e-01,   9.45071697e-01,

        -2.35498585e-02,  -5.21626115e-01,   1.13025808e+00,

        -1.67293274e+00,  -2.24904671e-01,  -8.13860118e-01,

        -4.53192621e-01,  -2.13154644e-01,   4.65950929e-02,

         1.29193068e-01,  -6.40475228e-02,  -1.21741116e+00,

         1.86280087e-01,   8.68674144e-02,  -1.09420717e+00,

         8.19482096e-03,  -7.45698586e-02,  -1.16133177e+00,

         7.06594527e-01,   7.71784961e-01,  -7.01051205e-02,

         6.90828502e-01,  -1.52761474e-01,  -5.61881602e-01,

...................................................

         2.23608285e-01,  -8.73272657e-01,   7.49607459e-02,

         1.51212966e+00,  -7.33180463e-01,  -6.13278568e-01,

         1.78863153e-01,   1.22361040e+00,  -1.30831683e+00,

        -3.13518018e-01], dtype=float32)

更详细用法参考：

https://radimrehurek.com/gensim/models/word2vec.html

http://rare-technologies.com/word2vec-tutorial/

感谢52nlp

posted on 2016-03-15 19:59 helloever 阅读(8000) 评论(0) 收藏举报

刷新页面返回顶部

科技改变世界，世界改变你

中文维基百科上的word2vec实验，python及java版本

公告

导航