GenSim——NLP工具
GenSim——NLP工具
GenSim is an open source python library for nlp modelling. API online docs
(from official site:)
GenSim: topic modelling for humans.
Train large-scale semantic nlp models.
Represent text as semantic vectors.
Find semantically related documents.
corpura.Dictionary
Methods:
doc2bow()represent a doc as a vector using BOW modeltoken2id(attribute) dict from token to intfilter_tokens(tokens)remove token from the dictionarycompactify()remove gaps in id sequence after words that were removed
corpura.Dictionary(*)
Models
Word2Vec, word2vec model
gensim.models.Word2Vec
Word2Vec(sentences=)[word]: <vector>most_similar(positive=, topn=): list[str]save()Word2Vec.load()class method.
gensim.models.KeyedVectors.load_word2vec_format(<filepath>, binary):
<filepath>path to word2vec model filebinaryvalue fromTrueorFalse.Truefor word2vec model file with binary format.Falsefor text format.
Train
custom preprocessing: custom class with __iiter():list[str] yielding a list of tokens.
gensim.models.Word2Vec() return a word2vec model instance which is trained from given corpus. parameters:
sentences: iterator[list[str]]min_countvector_sizedimension of vectorworkers:inttraining parallelism.- ...
KeyedVectors
load(fname_or_handle)load from fileload_word2vec_format()from file with C formatvector_sizedimension['word']get vectorvocabget vocabulary
from gensim.models import KeyedVectors
model = KeyedVectors.load('/path/to/w2v.model') # load from file with gensim format
KeyedVectors.load_word2vec_format('/path/to/w2v.model') # load from a keyed vec file
KeyedVectors.load_word2vec_format('/path/to/w2v.bin', binary=True) # load from a binary format
vec = model['word'] # query vector for a word
dim = model.vector_size # get the dimension of vectors
if my_word in model.vocab: # test word existing (oov)
if my_word in model.wv.vocab: # for gensim 2.x
Creating an of Word2Vec from a python dict:
import numpy as np
d = dict()
d['my_word1'] = np.random.randn(300)
d['my_word2'] = np.random.randn(300)
from gensim.models.keyedvectors import Vocab
from gensim.models import KeyedVectors
import tempfile
word_list, vector_list = zip(*d.items())
m = KeyedVectors(vector_size = 0)
m.vocab = dict()
m.vectors = np.array(vector_list)
for i in range(len(vector_list)):
m.vocab[word_list[i]] = Vocab(index = i, count = 1)
with tempfile.NamedTemporaryFile(delete = False) as f:
m.save_word2vec_format(f, binary = True)
tempfilew2v = f.name
m = KeyedVectors.load_word2vec_format(tempfilew2v, binary = True)
m.most_similar('my_word1')
# remove temp file
# import os
# os.remove(tempfilew2v)
TfidfModel
models.TfidfModel(), parameters:
corpusnormalize
LsiModel, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)
RpModel, Random Porjections
LdaModel, Latent Dirichlet Allocation, LDA
HdpModel, Hierarchical Dirichlet Process, HDP
Doc2Vec, doc2vec (paragraph to vector) model
gensim.models.doc2vec.Doc2Vec
FastText, fasttext model
gensim.models.fasttext.FastText
gensim.models.LdaModel, LDA model
Corpora Formats
MmCorpusMatrix MarketSvmLightCorpusJoachim’s SVMlight formatBleiCorpusBlei’s LDA-C formatLowCorpusGibbsLDA++ format
Install
pip install gensim
# or using conda
conda install gensim
Bugs
Failed to support for generator of corpus
from gensim.models import Word2Vec
Word2Vec(create_generator(), vector_size=SIZE) # ERROR, a generator cannot be consumed for 2 or more passes
#------
w2v = Word2Vec(vector_size=SIZE)
w2v.train(create_generator()) # ERROR, must build vocabulary first
in version 4.0.1: A TypeError will be raised from Word2Vec._check_corpus_sanity() if passing a generator to Word2Vec as sentences. The error says TypeError: Using a generator as corpus_iterable can't support 6 passes. Try a re-iterable sequence. from the check if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:. A generator cannot be read for 2 or more passes, which make sense. However, here 6 comes from the default epochs=5 plus 1 (1 extra pass to build vocabulary before training), self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=(epochs + 1)), code location: word2vec.py:417(Gensim 4.0.1), and epochs cannot be 0, or the ._check_training_sanity() will raise an error.

浙公网安备 33010602011771号