预训练词向量

(1)将dataframe中的word列(是分好词并做过预处理的文本)转成list:

text = x.pop('word')
text = np.array(text)
text = text.tolist()
labels = x['label']
labels = np.array(labels)
labels = labels.tolist()

(2)使用keras的Tokenizer实现文本序列化

#文本序列化
tokenizer = Tokenizer(nb_words=MAX_WORDS_NUM,split=" ")
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LEN)

(3)预训练词向量,使用的是https://github.com/Embedding/Chinese-Word-Vectors中微博的300训练好的词向量

#预训练词向量
def pre_trained():
    embeddings_index = {}
    f = open('./source/pretrain/sgns.weibo.word.txt')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

  

 

posted @ 2020-05-07 18:21  青晨forever  阅读(642)  评论(0)    收藏  举报