keras小程序(一),用cnn做分类

 

为了显示代码的友好性,我会把代码的每一步运行的结果显示出来,让读者可以有一个直观的认识,了解每一步代码的意思,下面我会先以几条数据为例,让读者可以直观的认识每段代码执行出来的效果,文章末我会已一个大数据集实验,并且给出实验效果,读者可以参考

一、  首先,笔者的数据存放在两个excel,一个是存放的是pos评论,一个是neg评论。分别是poss.xlsxnegg.xlsx,里面的内容如下:

poss.xls的内容是:

        neg.xls的内容是:

二、  然后,读入数据了,具体代码如下

import numpy as np

import pandas as pd

pos = pd.read_excel('poss.xlsx', header=None)#读入数据到pandas数据框架

pos['label'] = 1#添加标签列为1

neg = pd.read_excel('negg.xlsx', header=None)

neg['label'] = 0#添加标签列为0

all= pos.append(neg, ignore_index=True)#合并预料
View Code

 

print(all) 这段代码运行的效果是这样的:

接下来是分词了

cw=lambda s: list(jieba.cut(s))#调用结巴分词

all['words'] = all[0].apply(cw)
View Code

print(all['words'])这段代码运行的效果是这样的:

 

把所有的词组成一个大的词典

 

all['words'] = all[0].apply(cw)
content = []
for i in all['words']:
    content.extend(i)
abc = pd.Series(content).value_counts()

 

给每个词一个固定的编号

 

abc[:] = range(1, len(abc)+1)
abc[''] = 0 
maxlen=10
def doc2num(s, maxlen):
    s = [i for i in s if i in abc.index]
    s = s[:maxlen] + ['']*max(0, maxlen-len(s))
    return list(abc[s])
all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen))
View Code

结果如下:

打乱数据,并且生成keras的输入数据

 

idx = range(len(all))
np.random.shuffle(idx)
all= all.loc[idx]
x = np.array(list(all['doc2num']))
y = np.array(list(all['label']))
y = y.reshape((-1,1)) 
View Code

 

首先,我们看下x里面的数据形式,如下图:

接下来就是用keras搭建卷积神经网络模型了

model = Sequential()
model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen))
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu'))
model.add(GlobalMaxPooling1D())


model.add(Dense(128))
model.add(Dropout(0.2))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

最后就是对1000条积极评论和1000条消极评论的情感分类代码了,代码如下:

from __future__ import print_functionimport jieba
import pandas as pd

import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
embedding_vecor_length = 32 
maxlen =200  
min_count=5
batch_size = 32
nb_epoch =10
nb_filter =128 
filter_length = 3 
pos = pd.read_excel('poss.xls', header=None)
pos['label'] = 1
neg = pd.read_excel('negg.xls', header=None)
neg['label'] = 0
all= pos.append(neg, ignore_index=True)
cw=lambda s: list(jieba.cut(s))
all['words'] = all[0].apply(cw)
content = []
for i in all['words']:
    content.extend(i)
abc = pd.Series(content).value_counts()
abc[:] = range(1, len(abc)+1)
abc[''] = 0 
def doc2num(s, maxlen):
    s = [i for i in s if i in abc.index]
    s = s[:maxlen] + ['']*max(0, maxlen-len(s))
    return list(abc[s])
all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen))
idx = range(len(all))
np.random.shuffle(idx)
all= all.loc[idx]
x = np.array(list(all['doc2num']))
y = np.array(list(all['label']))
y = y.reshape((-1,1)) 
train_num=1600
X_train=x[:train_num]
y_train=y[:train_num]
X_test=x[train_num:]
y_test=y[train_num:]
model = Sequential()
model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen))
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu'))
model.add(GlobalMaxPooling1D())


model.add(Dense(128))
model.add(Dropout(0.2))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,verbose=0)
print('Test score:', score)
print('Test accuracy:', acc)



print('Train...')

model.fit(X_train, y_train, batch_size=batch_size,nb_epoch=nb_epoch,validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,verbose=0)
print('Test score:', score)
print('Test accuracy:', acc)

 

结果如下:

 

posted @ 2017-03-02 20:35  Erinlp  阅读(1213)  评论(0编辑  收藏  举报