RNN里的embeddings层 - 实践

要看一下embedding如何工作,我们看一下预测航空公司客户sentiment的数据库,基于客户tweets:

先导入相关的包:就是1. 总

#import relevant packages

from keras.layers import Dense, Activation

from keras.layers.recurrent import SimpleRNN

from keras.models import Sequential

from keras.utils import to_categorical

from keras.layers.embeddings import Embedding

from sklearn.cross_validation import train_test_split

import numpy as np

import nltk

from nltk.corpus import stopwords

import re

import pandas as pd

#Let us go ahead and read the dataset:

t=pd.read_csv('/home/akishore/airline_sentiment.csv')

t.head()

import numpy as np

t['sentiment']=np.where(t['airline_sentiment']=="positive",1,0)

2. 因为文本有噪音,我们要预处理它,去除逗号并转换所有的词到小字字母:

def preprocess(text):

text=text.lower()

text=re.sub('[^0-9a-zA-Z]+',' ',text)

words = text.split()

#words2=[w for w in words if (w not in stop)]

#words3=[ps.stem(w) for w in words]

words4=' '.join(words)

return(words4)

t['text'] = t['text'].apply(preprocess)

3. 与前一节相似,大家转换词到索引值:

from collections import Counter

counts = Counter()

for i,review in enumerate(t['text']):

counts.update(review.split())

words = sorted(counts, key=counts.get, reverse=True)

words[:10]

chars = words

nb_chars = len(words)

word_to_int = {word: i for i, word in enumerate(words, 1)}

int_to_word = {i: word for i, word in enumerate(words, 1)}

word_to_int['the']

#3

int_to_word[3]

#the

4. 映射词到相应的索引:

mapped_reviews = []

for review in t['text']:

mapped_reviews.append([word_to_int[word] for word in review.

split()])

t.loc[0:1]['text']

mapped_reviews[0:2]

注意, virginamerica的索引在两个reviews里一样 (104)。

5. 初始化长度为200的零序列。注意我们选择200作为句子的长度,因为没有长度超过200个词的 reviews。而且,下面代码的第二部分保证所有的reviews长度小于200个词。所有的开始索引加零,最后的索引与词在 review里的呈现相对应:

sequence_length = 200

sequences = np.zeros((len(mapped_reviews), sequence_

length),dtype=int)

for i, row in enumerate(mapped_reviews):

review_arr = np.array(row)

sequences[i, -len(row):] = review_arr[-sequence_length:]

6. 我们分割数据集:

y=t['sentiment'].values

X_train, X_test, y_train, y_test = train_test_split(sequences, y,

test_size=0.30,random_state=10)

y_train2 = to_categorical(y_train)

y_test2 = to_categorical(y_test)

7. 有了数据集,我们创建模型。注意, embedding作为函数取唯一词的总数、降低的维表达的词、输入的词数作为输入:

top_words=12679

embedding_vecor_length=32

max_review_length=200

model = Sequential()

model.add(Embedding(top_words, embedding_vecor_length,

input_length=max_review_length))

model.add(SimpleRNN(1, return_sequences=False,unroll=True))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam',

metrics=['accuracy'])

print(model.summary())

model.fit(X_train, y_train2, validation_data=(X_test, y_test2),

epochs=50, batch_size=1024)

我们看一下模型的汇总。数据集里共有12,679个唯一的词。embedding层保证我们用32-维空间来呈现词,因此embedding层有405,728个参数。

现在我们有32维embedded输入,每个输入与一个隐藏层单元连接—因此有 32个权重. 跟着32个权重有1个偏置。与这一层相应的最后权重是当前隐藏单元与上一隐藏单元相连的权重。因此有34个权重。注意,假如有一个输出来自 embedding 层,我们不需要在SimpleRNN层指明输入形状。一旦运行模型,输出分类准确率为 87%。

posted @ 2025-08-14 13:44  wzzkaifa  阅读(4)  评论(0)    收藏  举报