【文本分类-03】charCNN

目录

  1. 大纲概述
  2. 数据集合
  3. 数据处理
  4. 预训练word2vec模型

一、大纲概述

文本分类这个系列将会有8篇左右文章,从github直接下载代码,从百度云下载训练数据,在pycharm上导入即可使用,包括基于word2vec预训练的文本分类,与及基于近几年的预训练模型(ELMo,BERT等)的文本分类。总共有以下系列:

word2vec预训练词向量

textCNN 模型

charCNN 模型

Bi-LSTM 模型

Bi-LSTM + Attention 模型

Transformer 模型

ELMo 预训练模型

BERT 预训练模型

charCNN 模型结构

在charCNN论文Character-level Convolutional Networks for Text Classification中提出了6层卷积层 + 3层全连接层的结构,具体结构如下图:

针对不同大小的数据集提出了两种结构参数:

1)卷积层

2)全连接层

二、数据集合

数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),但是在训练word2vec词向量模型(无监督学习)时可以将无标签的数据一起用上。

训练数据地址:链接:https://pan.baidu.com/s/1-XEwx1ai8kkGsMagIFKX_g 提取码:rtz8

三、主要代码 

3.1 配置训练参数:parameter_config.py

    1 	# Author:yifan
    2 	# 1、参数配置
    3 	class TrainingConfig(object):
    4 	    epoches = 6
    5 	    evaluateEvery = 100
    6 	    checkpointEvery = 100
    7 	    learningRate = 0.001
    8 	
    9 	class ModelConfig(object):
   10 	    # 该列表中子列表的三个元素分别:卷积核的数量,卷积核的高度,池化的尺寸
   11 	    convLayers = [[256, 7, 4],
   12 	                  [256, 7, 4],
   13 	                  [256, 3, 4]]
   14 	    fcLayers = [512]
   15 	    dropoutKeepProb = 0.5
   16 	    epsilon = 1e-3  # BN层中防止分母为0而加入的极小值
   17 	    decay = 0.999  # BN层中用来计算滑动平均的值
   18 	
   19 	class Config(object):
   20 	 # 我们使用论文中提出的69个字符来表征输入数据
   21 	    alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
   22 	#  alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
   23 	    sequenceLength = 1014  # 字符表示的序列长度
   24 	    batchSize = 128
   25 	    rate = 0.8  # 训练集的比例
   26 	    dataSource = "../data/preProcess/labeledCharTrain.csv"
   27 	    training = TrainingConfig()
   28 	    model = ModelConfig()
   29 	config = Config()

3.2 获取训练数据:get_train_data.py

1) 加载数据,将所有的句子分割成字符表示

2) 构建字符-索引映射表,并保存成json的数据格式,方便在inference阶段加载使用

3)将字符转换成one-hot的嵌入形式,作为模型中embedding层的初始化值。

4) 将数据集分割成训练集和验证集

    1 	# Author:yifan
    2 	import json
    3 	import pandas as pd
    4 	import  numpy as np
    5 	import parameter_config
    6 	# 2、 训练数据生成
    7 	#   1) 加载数据,将所有的句子分割成字符表示
    8 	#   2) 构建字符-索引映射表,并保存成json的数据格式,方便在inference阶段加载使用
    9 	#   3)将字符转换成one-hot的嵌入形式,作为模型中embedding层的初始化值。
   10 	#   4) 将数据集分割成训练集和验证集
   11 	# 数据预处理的类,生成训练集和测试集
   12 	class Dataset(object):
   13 	    def __init__(self, config):   #config.的部分都是从parameter.config.py中带出
   14 	        self._dataSource = config.dataSource             #路径
   15 	        self._sequenceLength = config.sequenceLength    # 字符表示的序列长度
   16 	        self._rate = config.rate                        # 训练集的比例
   17 	        self._alphabet = config.alphabet
   18 	        self.trainReviews = []
   19 	        self.trainLabels = []
   20 	        self.evalReviews = []
   21 	        self.evalLabels = []
   22 	        self.charEmbedding = None
   23 	        self._charToIndex = {}
   24 	        self._indexToChar = {}
   25 	
   26 	    def _readData(self, filePath):
   27 	        """
   28 	        从csv文件中读取数据集
   29 	        """
   30 	        df = pd.read_csv(filePath)
   31 	        labels = df["sentiment"].tolist()
   32 	        review = df["review"].tolist()
   33 	        reviews = [[char for char in line if char != " "] for line in review]
   34 	        return reviews, labels
   35 	
   36 	    def _reviewProcess(self, review, sequenceLength, charToIndex):
   37 	        """
   38 	        将数据集中的每条评论用index表示
   39 	        wordToIndex中“pad”对应的index为0
   40 	        """
   41 	        reviewVec = np.zeros((sequenceLength))
   42 	        sequenceLen = sequenceLength
   43 	        # 判断当前的序列是否小于定义的固定序列长度
   44 	        if len(review) < sequenceLength:
   45 	            sequenceLen = len(review)
   46 	        for i in range(sequenceLen):
   47 	            if review[i] in charToIndex:
   48 	                reviewVec[i] = charToIndex[review[i]]
   49 	            else:
   50 	                reviewVec[i] = charToIndex["UNK"]
   51 	        return reviewVec
   52 	
   53 	    def _genTrainEvalData(self, x, y, rate):
   54 	        """
   55 	        生成训练集和验证集,最后生成的一行表示一个句子,包含单词数为sequenceLength = 1014。每个单词用index表示
   56 	        """
   57 	        reviews = []
   58 	        labels = []
   59 	        # 遍历所有的文本,将文本中的词转换成index表示
   60 	        for i in range(len(x)):
   61 	            reviewVec = self._reviewProcess(x[i], self._sequenceLength, self._charToIndex)
   62 	            reviews.append(reviewVec)
   63 	            labels.append([y[i]])
   64 	        trainIndex = int(len(x) * rate)
   65 	        trainReviews = np.asarray(reviews[:trainIndex], dtype="int64")
   66 	        trainLabels = np.array(labels[:trainIndex], dtype="float32")
   67 	        evalReviews = np.asarray(reviews[trainIndex:], dtype="int64")
   68 	        evalLabels = np.array(labels[trainIndex:], dtype="float32")
   69 	        return trainReviews, trainLabels, evalReviews, evalLabels
   70 	
   71 	    def _getCharEmbedding(self, chars):
   72 	        """
   73 	        按照one的形式将字符映射成向量
   74 	        字母pad表示【0,0,0...】,UNK是【1,0,0...】,a表示【0,1,0...】等等
   75 	        """
   76 	        alphabet = ["UNK"] + [char for char in self._alphabet]
   77 	        vocab = ["pad"] + alphabet
   78 	        charEmbedding = []
   79 	        charEmbedding.append(np.zeros(len(alphabet), dtype="float32"))
   80 	
   81 	        for i, alpha in enumerate(alphabet):
   82 	            onehot = np.zeros(len(alphabet), dtype="float32")
   83 	            # 生成每个字符对应的向量
   84 	            onehot[i] = 1
   85 	            # 生成字符嵌入的向量矩阵
   86 	            charEmbedding.append(onehot)
   87 	        return vocab, np.array(charEmbedding)
   88 	
   89 	    def _genVocabulary(self, reviews):
   90 	        """
   91 	        生成字符向量和字符-索引映射字典
   92 	        """
   93 	        chars = [char for char in self._alphabet]
   94 	        vocab, charEmbedding = self._getCharEmbedding(chars)
   95 	        self.charEmbedding = charEmbedding
   96 	
   97 	        self._charToIndex = dict(zip(vocab, list(range(len(vocab)))))
   98 	        self._indexToChar = dict(zip(list(range(len(vocab))), vocab))
   99 	
  100 	        # 将词汇-索引映射表保存为json数据,之后做inference时直接加载来处理数据
  101 	        with open("../data/charJson/charToIndex.json", "w", encoding="utf-8") as f:
  102 	            json.dump(self._charToIndex, f)
  103 	        with open("../data/charJson/indexToChar.json", "w", encoding="utf-8") as f:
  104 	            json.dump(self._indexToChar, f)
  105 	
  106 	    def dataGen(self):
  107 	        """
  108 	        初始化训练集和验证集
  109 	        """
  110 	        # 初始化数据集
  111 	        # reviews: [['"', 'w', 'i', 't', 'h', 'a', 'l', 'l', 't', 'h', 'i', 's', 's', 't', 'u', 'f', 'f
  112 	        #labels:[1, ...
  113 	        reviews, labels = self._readData(self._dataSource)
  114 	        # 初始化词汇-索引映射表和词向量矩阵
  115 	        self._genVocabulary(reviews)
  116 	        # 初始化训练集和测试集  训练集20000,测试集5000
  117 	        trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviews, labels, self._rate)
  118 	        self.trainReviews = trainReviews
  119 	        self.trainLabels = trainLabels
  120 	        self.evalReviews = evalReviews
  121 	        self.evalLabels = evalLabels
  122 	        # print(trainReviews)
  123 	        # print("++++")
  124 	        # print(trainLabels)
  125 	        # print(len(trainReviews[0]))
  126 	        # print(len(trainReviews[2]))
  127 	        # print(len(evalLabels))
  128 	#test
  129 	# config =parameter_config.Config()
  130 	# data = Dataset(config)
  131 	# data.dataGen()

3.3 模型构建:mode_structure.py

    1 	# Author:yifan
    2 	import tensorflow as tf
    3 	import math
    4 	import parameter_config
    5 	
    6 	# 构建模型  3 textCNN 模型
    7 	# 定义char-CNN分类器
    8 	class CharCNN(object):
    9 	    """
   10 	    char-CNN用于文本分类   
   11 	    在charCNN 模型中我们引入了BN层,但是效果并不明显,甚至存在一些收敛问题,待之后去探讨。
   12 	    """
   13 	    def __init__(self, config, charEmbedding):
   14 	        # placeholders for input, output and dropuot
   15 	        self.inputX = tf.placeholder(tf.int32, [None, config.sequenceLength], name="inputX")
   16 	        self.inputY = tf.placeholder(tf.float32, [None, 1], name="inputY")
   17 	        self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb")
   18 	        self.isTraining = tf.placeholder(tf.bool, name="isTraining")
   19 	        self.epsilon = config.model.epsilon
   20 	        self.decay = config.model.decay
   21 	
   22 	        # 字符嵌入
   23 	        with tf.name_scope("embedding"):
   24 	            # 利用one-hot的字符向量作为初始化词嵌入矩阵
   25 	            self.W = tf.Variable(tf.cast(charEmbedding, dtype=tf.float32, name="charEmbedding"), name="W")
   26 	            # 获得字符嵌入
   27 	            self.embededChars = tf.nn.embedding_lookup(self.W, self.inputX)
   28 	            # 添加一个通道维度
   29 	            self.embededCharsExpand = tf.expand_dims(self.embededChars, -1)
   30 	
   31 	        for i, cl in enumerate(config.model.convLayers):
   32 	            print("开始第" + str(i + 1) + "卷积层的处理")
   33 	            # 利用命名空间name_scope来实现变量名复用
   34 	            with tf.name_scope("convLayer-%s" % (i + 1)):
   35 	                # 获取字符的向量长度
   36 	                filterWidth = self.embededCharsExpand.get_shape()[2].value
   37 	                # filterShape = [height, width, in_channels, out_channels]
   38 	                filterShape = [cl[1], filterWidth, 1, cl[0]]
   39 	                stdv = 1 / math.sqrt(cl[0] * cl[1])
   40 	
   41 	                # 初始化w和b的值
   42 	                wConv = tf.Variable(tf.random_uniform(filterShape, minval=-stdv, maxval=stdv),
   43 	                                    dtype='float32', name='w')
   44 	                bConv = tf.Variable(tf.random_uniform(shape=[cl[0]], minval=-stdv, maxval=stdv), name='b')
   45 	
   46 	                #                 w_conv = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="w")
   47 	                #                 b_conv = tf.Variable(tf.constant(0.1, shape=[cl[0]]), name="b")
   48 	                # 构建卷积层,可以直接将卷积核的初始化方法传入(w_conv)
   49 	                conv = tf.nn.conv2d(self.embededCharsExpand, wConv, strides=[1, 1, 1, 1], padding="VALID", name="conv")
   50 	                # 加上偏差
   51 	                hConv = tf.nn.bias_add(conv, bConv)
   52 	                # 可以直接加上relu函数,因为tf.nn.conv2d事实上是做了一个卷积运算,然后在这个运算结果上加上偏差,再导入到relu函数中
   53 	                hConv = tf.nn.relu(hConv)
   54 	
   55 	                #                 with tf.name_scope("batchNormalization"):
   56 	                #                     hConvBN = self._batchNorm(hConv)
   57 	
   58 	                if cl[-1] is not None:
   59 	                    ksizeShape = [1, cl[2], 1, 1]
   60 	                    hPool = tf.nn.max_pool(hConv, ksize=ksizeShape, strides=ksizeShape, padding="VALID", name="pool")
   61 	                else:
   62 	                    hPool = hConv
   63 	
   64 	                print(hPool.shape)
   65 	
   66 	                # 对维度进行转换,转换成卷积层的输入维度
   67 	                self.embededCharsExpand = tf.transpose(hPool, [0, 1, 3, 2], name="transpose")
   68 	        print(self.embededCharsExpand)
   69 	        with tf.name_scope("reshape"):
   70 	            fcDim = self.embededCharsExpand.get_shape()[1].value * self.embededCharsExpand.get_shape()[2].value
   71 	            self.inputReshape = tf.reshape(self.embededCharsExpand, [-1, fcDim])
   72 	
   73 	        weights = [fcDim] + config.model.fcLayers
   74 	
   75 	        for i, fl in enumerate(config.model.fcLayers):   #fcLayers = [512]
   76 	            with tf.name_scope("fcLayer-%s" % (i + 1)):
   77 	                print("开始第" + str(i + 1) + "全连接层的处理")
   78 	                stdv = 1 / math.sqrt(weights[i])
   79 	                # 定义全连接层的初始化方法,均匀分布初始化w和b的值
   80 	                wFc = tf.Variable(tf.random_uniform([weights[i], fl], minval=-stdv, maxval=stdv), dtype="float32",
   81 	                                  name="w")
   82 	                bFc = tf.Variable(tf.random_uniform(shape=[fl], minval=-stdv, maxval=stdv), dtype="float32", name="b")
   83 	
   84 	                #                 w_fc = tf.Variable(tf.truncated_normal([weights[i], fl], stddev=0.05), name="W")
   85 	                #                 b_fc = tf.Variable(tf.constant(0.1, shape=[fl]), name="b")
   86 	
   87 	                self.fcInput = tf.nn.relu(tf.matmul(self.inputReshape, wFc) + bFc)
   88 	                with tf.name_scope("dropOut"):
   89 	                    self.fcInputDrop = tf.nn.dropout(self.fcInput, self.dropoutKeepProb)
   90 	            self.inputReshape = self.fcInputDrop
   91 	
   92 	        with tf.name_scope("outputLayer"):
   93 	            stdv = 1 / math.sqrt(weights[-1])
   94 	            # 定义隐层到输出层的权重系数和偏差的初始化方法
   95 	            #             w_out = tf.Variable(tf.truncated_normal([fc_layers[-1], num_classes], stddev=0.1), name="W")
   96 	            #             b_out = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
   97 	
   98 	            wOut = tf.Variable(tf.random_uniform([config.model.fcLayers[-1], 1], minval=-stdv, maxval=stdv),
   99 	                               dtype="float32", name="w")
  100 	            bOut = tf.Variable(tf.random_uniform(shape=[1], minval=-stdv, maxval=stdv), name="b")
  101 	            # tf.nn.xw_plus_b就是x和w的乘积加上b
  102 	            self.predictions = tf.nn.xw_plus_b(self.inputReshape, wOut, bOut, name="predictions")
  103 	            # 进行二元分类
  104 	            self.binaryPreds = tf.cast(tf.greater_equal(self.predictions, 0.0), tf.float32, name="binaryPreds")
  105 	
  106 	        with tf.name_scope("loss"):
  107 	            # 定义损失函数,对预测值进行softmax,再求交叉熵。
  108 	            losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.predictions, labels=self.inputY)
  109 	            self.loss = tf.reduce_mean(losses)
  110 	
  111 	    def _batchNorm(self, x):
  112 	        # BN层代码实现
  113 	        gamma = tf.Variable(tf.ones([x.get_shape()[3].value]))
  114 	        beta = tf.Variable(tf.zeros([x.get_shape()[3].value]))
  115 	        self.popMean = tf.Variable(tf.zeros([x.get_shape()[3].value]), trainable=False, name="popMean")
  116 	        self.popVariance = tf.Variable(tf.ones([x.get_shape()[3].value]), trainable=False, name="popVariance")
  117 	
  118 	        def batchNormTraining():
  119 	            # 一定要使用正确的维度确保计算的是每个特征图上的平均值和方差而不是整个网络节点上的统计分布值
  120 	            batchMean, batchVariance = tf.nn.moments(x, [0, 1, 2], keep_dims=False)
  121 	            decay = 0.99
  122 	            trainMean = tf.assign(self.popMean, self.popMean * self.decay + batchMean * (1 - self.decay))
  123 	            trainVariance = tf.assign(self.popVariance,
  124 	                                      self.popVariance * self.decay + batchVariance * (1 - self.decay))
  125 	
  126 	            with tf.control_dependencies([trainMean, trainVariance]):
  127 	                return tf.nn.batch_normalization(x, batchMean, batchVariance, beta, gamma, self.epsilon)
  128 	
  129 	        def batchNormInference():
  130 	            return tf.nn.batch_normalization(x, self.popMean, self.popVariance, beta, gamma, self.epsilon)
  131 	        batchNormalizedOutput = tf.cond(self.isTraining, batchNormTraining, batchNormInference)
  132 	        return tf.nn.relu(batchNormalizedOutput)

3.4 模型训练:mode_trainning.py

    1 	# Author:yifan
    2 	import os
    3 	import datetime
    4 	import warnings
    5 	import numpy as np
    6 	import tensorflow as tf
    7 	from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score
    8 	warnings.filterwarnings("ignore")
    9 	import parameter_config
   10 	import get_train_data
   11 	import mode_structure
   12 	
   13 	#获取前些模块的数据
   14 	config =parameter_config.Config()
   15 	data = get_train_data.Dataset(config)
   16 	data.dataGen()
   17 	
   18 	#4生成batch数据集
   19 	def nextBatch(x, y, batchSize):
   20 	    # 生成batch数据集,用生成器的方式输出
   21 	    perm = np.arange(len(x))
   22 	    np.random.shuffle(perm)
   23 	    x = x[perm]
   24 	    y = y[perm]
   25 	    # print("++++++++++++++")
   26 	    # print(x)
   27 	    numBatches = len(x) // batchSize
   28 	
   29 	    for i in range(numBatches):
   30 	        start = i * batchSize
   31 	        end = start + batchSize
   32 	        batchX = np.array(x[start: end], dtype="int64")
   33 	        batchY = np.array(y[start: end], dtype="float32")
   34 	        yield batchX, batchY
   35 	
   36 	# 5 定义计算metrics的函数
   37 	"""
   38 	定义各类性能指标
   39 	"""
   40 	def mean(item):
   41 	    return sum(item) / len(item)
   42 	
   43 	def genMetrics(trueY, predY, binaryPredY):
   44 	    """
   45 	    生成acc和auc值
   46 	    """
   47 	    auc = roc_auc_score(trueY, predY)
   48 	    accuracy = accuracy_score(trueY, binaryPredY)
   49 	    precision = precision_score(trueY, binaryPredY, average='macro')
   50 	    recall = recall_score(trueY, binaryPredY, average='macro')
   51 	    return round(accuracy, 4), round(auc, 4), round(precision, 4), round(recall, 4)
   52 	
   53 	# 6 训练模型
   54 	    # 生成训练集和验证集
   55 	trainReviews = data.trainReviews
   56 	trainLabels = data.trainLabels
   57 	evalReviews = data.evalReviews
   58 	evalLabels = data.evalLabels
   59 	charEmbedding = data.charEmbedding
   60 	
   61 	    # 定义计算图
   62 	with tf.Graph().as_default():
   63 	    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
   64 	    session_conf.gpu_options.allow_growth = True
   65 	    session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9  # 配置gpu占用率
   66 	    sess = tf.Session(config=session_conf)
   67 	
   68 	    # 定义会话
   69 	    with sess.as_default():
   70 	        cnn = mode_structure.CharCNN(config, charEmbedding)
   71 	        globalStep = tf.Variable(0, name="globalStep", trainable=False)
   72 	        # 定义优化函数,传入学习速率参数
   73 	        optimizer = tf.train.RMSPropOptimizer(config.training.learningRate)
   74 	        # 计算梯度,得到梯度和变量
   75 	        gradsAndVars = optimizer.compute_gradients(cnn.loss)
   76 	        # 将梯度应用到变量下,生成训练器
   77 	        trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep)
   78 	
   79 	        # 用summary绘制tensorBoard
   80 	        gradSummaries = []
   81 	        for g, v in gradsAndVars:
   82 	            if g is not None:
   83 	                tf.summary.histogram("{}/grad/hist".format(v.name), g)
   84 	                tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
   85 	        outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys"))
   86 	        print("Writing to {}\n".format(outDir))
   87 	        lossSummary = tf.summary.scalar("trainLoss", cnn.loss)
   88 	
   89 	        summaryOp = tf.summary.merge_all()
   90 	
   91 	        trainSummaryDir = os.path.join(outDir, "train")
   92 	        trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph)
   93 	        evalSummaryDir = os.path.join(outDir, "eval")
   94 	        evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph)
   95 	
   96 	        # 初始化所有变量
   97 	        saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
   98 	
   99 	        # 保存模型的一种方式,保存为pb文件
  100 	        builder = tf.saved_model.builder.SavedModelBuilder("../model/charCNN/savedModel")
  101 	        sess.run(tf.global_variables_initializer())
  102 	
  103 	        def trainStep(batchX, batchY):
  104 	            """
  105 	            训练函数
  106 	            """
  107 	            feed_dict = {
  108 	                cnn.inputX: batchX,
  109 	                cnn.inputY: batchY,
  110 	                cnn.dropoutKeepProb: config.model.dropoutKeepProb,
  111 	                cnn.isTraining: True
  112 	            }
  113 	            _, summary, step, loss, predictions, binaryPreds = sess.run(
  114 	                [trainOp, summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
  115 	                feed_dict)
  116 	            timeStr = datetime.datetime.now().isoformat()
  117 	            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
  118 	            print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(timeStr, step, loss,
  119 	                                                                                               acc, auc, precision,
  120 	                                                                                               recall))
  121 	            trainSummaryWriter.add_summary(summary, step)
  122 	
  123 	        def devStep(batchX, batchY):
  124 	            """
  125 	            验证函数
  126 	            """
  127 	            feed_dict = {
  128 	                cnn.inputX: batchX,
  129 	                cnn.inputY: batchY,
  130 	                cnn.dropoutKeepProb: 1.0,
  131 	                cnn.isTraining: False
  132 	            }
  133 	            summary, step, loss, predictions, binaryPreds = sess.run(
  134 	                [summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
  135 	                feed_dict)
  136 	
  137 	            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
  138 	
  139 	            evalSummaryWriter.add_summary(summary, step)
  140 	
  141 	            return loss, acc, auc, precision, recall
  142 	
  143 	        for i in range(config.training.epoches):
  144 	            # 训练模型
  145 	            print("start training model")
  146 	            for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize):
  147 	                trainStep(batchTrain[0], batchTrain[1])
  148 	
  149 	                currentStep = tf.train.global_step(sess, globalStep)
  150 	                if currentStep % config.training.evaluateEvery == 0:
  151 	                    print("\nEvaluation:")
  152 	
  153 	                    losses = []
  154 	                    accs = []
  155 	                    aucs = []
  156 	                    precisions = []
  157 	                    recalls = []
  158 	
  159 	                    for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize):
  160 	                        loss, acc, auc, precision, recall = devStep(batchEval[0], batchEval[1])
  161 	                        losses.append(loss)
  162 	                        accs.append(acc)
  163 	                        aucs.append(auc)
  164 	                        precisions.append(precision)
  165 	                        recalls.append(recall)
  166 	
  167 	                    time_str = datetime.datetime.now().isoformat()
  168 	                    print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(time_str,
  169 	                                                                                                       currentStep,
  170 	                                                                                                       mean(losses),
  171 	                                                                                                       mean(accs),
  172 	                                                                                                       mean(aucs),
  173 	                                                                                                       mean(
  174 	                                                                                                           precisions),
  175 	                                                                                                       mean(
  176 	                                                                                                           recalls)))
  177 	
  178 	                if currentStep % config.training.checkpointEvery == 0:
  179 	                    # 保存模型的另一种方法,保存checkpoint文件
  180 	                    path = saver.save(sess, "../model/charCNN/model/my-model", global_step=currentStep)
  181 	                    print("Saved model checkpoint to {}\n".format(path))
  182 	
  183 	        inputs = {"inputX": tf.saved_model.utils.build_tensor_info(cnn.inputX),
  184 	                  "keepProb": tf.saved_model.utils.build_tensor_info(cnn.dropoutKeepProb)}
  185 	
  186 	        outputs = {"binaryPreds": tf.saved_model.utils.build_tensor_info(cnn.binaryPreds)}
  187 	
  188 	        prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs,
  189 	                                                                                      outputs=outputs,
  190 	                                                                                      method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
  191 	        legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op")
  192 	        builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING],
  193 	                                             signature_def_map={"predict": prediction_signature},
  194 	                                             legacy_init_op=legacy_init_op)
  195 	
  196 	        builder.save()

3.5 预测:predict.py

    1 	# Author:yifan
    2 	import tensorflow as tf
    3 	import parameter_config
    4 	import get_train_data
    5 	config =parameter_config.Config()
    6 	data = get_train_data.Dataset(config)
    7 	
    8 	#7预测代码
    9 	x = "this movie is full of references like mad max ii the wild one and many others the ladybug´s face it´s a clear reference or tribute to peter lorre this movie is a masterpiece we´ll talk much more about in the future"
   10 	# x = "This film is not good"   #最终反馈为1
   11 	# x = "This film is   bad"   #最终反馈为0
   12 	# x = "This film is   good"   #最终反馈为1
   13 	
   14 	# 根据前面get_train_data获取,变成可以用来训练的向量。
   15 	y = list(x)
   16 	data._genVocabulary(y)
   17 	print(x)
   18 	reviewVec = data._reviewProcess(y, config.sequenceLength, data._charToIndex)
   19 	print(reviewVec)
   20 	
   21 	graph = tf.Graph()
   22 	with graph.as_default():
   23 	    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
   24 	    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, gpu_options=gpu_options)
   25 	    sess = tf.Session(config=session_conf)
   26 	    with sess.as_default():
   27 	        # 恢复模型
   28 	        checkpoint_file = tf.train.latest_checkpoint("../model/charCNN/model/")
   29 	        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
   30 	        saver.restore(sess, checkpoint_file)
   31 	
   32 	        # 获得需要喂给模型的参数,输出的结果依赖的输入值
   33 	        inputX          = graph.get_operation_by_name("inputX").outputs[0]
   34 	        dropoutKeepProb = graph.get_operation_by_name("dropoutKeepProb").outputs[0]
   35 	
   36 	        # 获得输出的结果
   37 	        predictions = graph.get_tensor_by_name("outputLayer/binaryPreds:0")
   38 	        pred = sess.run(predictions, feed_dict={inputX: [reviewVec], dropoutKeepProb: 1.0,})[0]
   39 	
   40 	# pred = [idx2label[item] for item in pred]
   41 	print(pred)

结果

 相关代码可见:https://github.com/yifanhunter/NLP_textClassifier

主要参考:

【1】 https://home.cnblogs.com/u/jiangxinyang/

posted @ 2020-07-22 21:59  忆凡人生  阅读(670)  评论(0编辑  收藏  举报