预训练词向量
(1)将dataframe中的word列(是分好词并做过预处理的文本)转成list:
text = x.pop('word') text = np.array(text) text = text.tolist() labels = x['label'] labels = np.array(labels) labels = labels.tolist()
(2)使用keras的Tokenizer实现文本序列化
#文本序列化 tokenizer = Tokenizer(nb_words=MAX_WORDS_NUM,split=" ") tokenizer.fit_on_texts(text) sequences = tokenizer.texts_to_sequences(text) word_index = tokenizer.word_index data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LEN)
(3)预训练词向量,使用的是https://github.com/Embedding/Chinese-Word-Vectors中微博的300训练好的词向量
#预训练词向量
def pre_trained():
embeddings_index = {}
f = open('./source/pretrain/sgns.weibo.word.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
return embeddings_index

浙公网安备 33010602011771号