GenSim——NLP工具
GenSim——NLP工具
GenSim is an open source python library for nlp modelling. API online docs
(from official site:)
GenSim: topic modelling for humans.
Train large-scale semantic nlp models.
Represent text as semantic vectors.
Find semantically related documents.
corpura.Dictionary
Methods:
- doc2bow()represent a doc as a vector using BOW model
- token2id(attribute) dict from token to int
- filter_tokens(tokens)remove token from the dictionary
- compactify()remove gaps in id sequence after words that were removed
corpura.Dictionary(*)
Models
Word2Vec, word2vec model
gensim.models.Word2Vec
- Word2Vec(sentences=)
- [word]: <vector>
- most_similar(positive=, topn=): list[str]
- save()
- Word2Vec.load()class method.
gensim.models.KeyedVectors.load_word2vec_format(<filepath>, binary):
- <filepath>path to word2vec model file
- binaryvalue from- Trueor- False.- Truefor word2vec model file with binary format.- Falsefor text format.
Train
custom preprocessing: custom class with __iiter():list[str] yielding a list of tokens.
gensim.models.Word2Vec() return a word2vec model instance which is trained from given corpus. parameters:
- sentences: iterator[list[str]]
- min_count
- vector_sizedimension of vector
- workers:inttraining parallelism.
- ...
KeyedVectors
- load(fname_or_handle)load from file
- load_word2vec_format()from file with C format
- vector_sizedimension
- ['word']get vector
- vocabget vocabulary
from gensim.models import KeyedVectors
model = KeyedVectors.load('/path/to/w2v.model')       # load from file with gensim format
KeyedVectors.load_word2vec_format('/path/to/w2v.model') # load from a keyed vec file
KeyedVectors.load_word2vec_format('/path/to/w2v.bin', binary=True)   # load from a binary format
vec = model['word']     # query vector for a word
dim = model.vector_size # get the dimension of vectors
if my_word in model.vocab:      # test word existing (oov)
if my_word in model.wv.vocab:       # for gensim 2.x
Creating an of Word2Vec from a python dict:
import numpy as np
d = dict()
d['my_word1'] = np.random.randn(300)
d['my_word2'] = np.random.randn(300)
from gensim.models.keyedvectors import Vocab
from gensim.models import KeyedVectors
import tempfile
word_list, vector_list = zip(*d.items())
m = KeyedVectors(vector_size = 0)
m.vocab = dict()
m.vectors = np.array(vector_list)
for i in range(len(vector_list)):
    m.vocab[word_list[i]] = Vocab(index = i, count = 1)
    
with tempfile.NamedTemporaryFile(delete = False) as f:
    m.save_word2vec_format(f, binary = True)
    tempfilew2v = f.name
m = KeyedVectors.load_word2vec_format(tempfilew2v, binary = True)
m.most_similar('my_word1')
# remove temp file
# import os
# os.remove(tempfilew2v)
TfidfModel
models.TfidfModel(), parameters:
- corpus
- normalize
LsiModel, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)
RpModel, Random Porjections
LdaModel, Latent Dirichlet Allocation, LDA
HdpModel, Hierarchical Dirichlet Process, HDP
Doc2Vec, doc2vec (paragraph to vector) model
gensim.models.doc2vec.Doc2Vec
FastText, fasttext model
gensim.models.fasttext.FastText
gensim.models.LdaModel, LDA model
Corpora Formats
- MmCorpusMatrix Market
- SvmLightCorpusJoachim’s SVMlight format
- BleiCorpusBlei’s LDA-C format
- LowCorpusGibbsLDA++ format
Install
pip install gensim
# or using conda
conda install gensim 
Bugs
Failed to support for generator of corpus
from gensim.models import Word2Vec
Word2Vec(create_generator(), vector_size=SIZE)    # ERROR, a generator cannot be consumed for 2 or more passes
#------
w2v = Word2Vec(vector_size=SIZE)
w2v.train(create_generator())           # ERROR, must build vocabulary first
in version 4.0.1:  A TypeError will be raised from Word2Vec._check_corpus_sanity() if passing a generator to Word2Vec as sentences. The error says TypeError: Using a generator as corpus_iterable can't support 6 passes. Try a re-iterable sequence. from the check if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:. A generator cannot be read for 2 or more passes, which make sense. However, here 6 comes from the default epochs=5 plus 1 (1 extra pass to build vocabulary before training), self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=(epochs + 1)), code location: word2vec.py:417(Gensim 4.0.1), and epochs cannot be 0, or the ._check_training_sanity() will raise an error.
 
                     
                    
                 
                    
                
 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号