Gensim中文教程——TF-IDF

转换接口

在之前已经创建了文档语料库(dictionary和corpus). 

为了揭示语料库中的隐藏结构,发现词之间的关系,并使用它们以新的,更语义的方式描述文档。使文档表示更加紧凑, 这既提高效率(表示消耗较少的资源)和效率(忽略边际数据趋势,降低噪声)。

1.创建一个转换

  转换是标准的Python对象,通常通过训练语料库初始化。

tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2.转换为垂直变量

doc_bow = [(0, 1), (1, 1)]
>>> print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)]

也可以应用于整个语料库

>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
...     print(doc)
[(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)]
[(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)]
[(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.70710678118654746), (10, 0.70710678118654746)]
[(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)]
[(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)]
View Code

在这种特殊情况下,我们正在转换我们用于训练的同一个语料库,但这只是偶然的。一旦变换模型已经被初始化,它可以用于任何向量(当然它们来自相同的向量空间),即使它们在训练语料库中没有使用。 这是通过称为折叠LSA的过程,通过LDA等的主题推断来实现的.

3.转换也可以串行化,一个在另一个之上,在一个链中:

>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

打印出主题:

lsi.print_topics(2)
topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

打印出在每篇文档中每个主题的比例

>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
...     print(doc)
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"
View Code

将结果保存和加载

>>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
>>> lsi = models.LsiModel.load('/tmp/model.lsi')

 

posted on 2017-02-26 19:33  水滴石穿—敏  阅读(1368)  评论(0)    收藏  举报

导航