[Tensorflow] Digit Recognition with few layers NN

入门级实践

Project 1 - Digit Recognition

Getting Started

Copy the archive src.zip into your own filespace and unzip it. Then type

cd src

You will see two files: train.py and hw1.py

Now run train.py by typing

python3 train.py

When run for the first time, train.py should create a new folder called data and download a copy of the MNIST dataset into this folder. All subsequent runs of train.py will use this local data. (Don't worry about the ValueError at this stage.)

The file train.py contains the TensorFlow code required to create a session, build the graph, and run training and test iterations. It has been provided to assist you with the testing and evaluation of your model. While it is not required for this assignment to have a detailed understanding of this code, it will be useful when implementing your own models, and for later assignments.

The file train.py calls functions defined in hw1.py and should not be modified during the course of the assignment. A submission that does not run correctly when train.py is called will lose marks. The only situation where you should modify train.py is when you need to switch between different network architectures. This can be done by setting the global variable on line 7:

network = "none"

to any of the following values:

network = "onelayer"
network = "twolayer"
network = "conv"

The file hw1.py contains function definitions for the three networks to be created. You may also define helper functions in this file if necessary, as long as the original function names and arguments are not modified. Changing the function name, argument list, or return value will cause all tests to fail for that function. Your marks will be automatically generated by a test script, which will evaluate the correctness of the implemented networks. For this reason, it is important that you stick to the specification exactly. Networks that do not meet the specifications but otherwise function accurately, will be marked as incorrect.

Stage 0: Provided Code

The functions input_placeholder() and target_placeholder() specify the inputs and outpus of your networks in the TensorFlow graph. They have been implemented for you.

In addition, there is a function train_step() that passes batches of images to the constructed TensorFlow Graph during training. It's implementation should help you understand the shape and structure of the actual data that is being provided to the model.

Unless otherwise specified, the underlying type (dtype) for each TF object should be float32. INPUT_SIZE, where it appears in comments, refers to the length of a flattened single image; in this case 784. OUTPUT_SIZE, where it appears in comments, refers to the length of a one-hot output vector; in this case 10.

Stage 1: Single-Layer Network (3 marks)

Write a function onelayer(X, Y, layersize=10) which creates a TensorFlow model for a one layer neural network (sometimes also called logistic regression). Your model should consist of one fully connected layer with weights w and biases b, using softmax activation.

Your function should take two parameters X and Y that are TensorFlow placeholders as defined in input_placeholder() and target_placeholder(). It should return varibles w, b, logits, preds, batch_xentropy and batch_loss, where:

w and b are TensorFlow variables representing the weights and biases, respectively
logits and preds are the input to the activation function and its output
xentropy_loss is the cross-entropy loss for each image in the batch
batch_loss is the average of the cross-entropy loss for all images in the batch

Change line 7 of train.py to network = "onelayer" and test your network on the MNIST dataset by typing

python3 train.py

It should achieve about 92% accuracy after 5 epochs of training.

It is a good idea to submit your code after completing Stage 1, because the submit script will run some simple tests and give you some feedback on whether your model is correctly structured.

Stage 2: Two-Layer Network (3 marks)

Create a TensorFlow model for a Neural Network with two fully connected layers of weights w1, w2 and biases b1, b2, with ReLUactivation functions on the first layer, and softmax on the second. Your function should take two parameters X and Y that are TensorFlow placeholders as defined in input_placeholder() and target_placeholder(). It should return varibles w1, b1, w2, b2, logits, preds, batch_xentropy and batch_loss, where:

w1 and b1 are TensorFlow variables representing the weights and biases of the first layer
w2 and b2 are TensorFlow variables representing the weights and biases of the second layer
logits and preds are the inputs to the final activation functions and their output
xentropy_loss is the cross-entropy loss for each image in the batch
batch_loss is the average of the cross-entropy loss for all images in the batch

Change line 7 of train.py to network = "twolayer" and test your network on the MNIST dataset by typing

python3 train.py

Note: if you are using the CSE Lab machines and are running out of memory, you may like to remove the files in the summaries directory from previous training runs.

Stage 4: Convolutional Network (4 marks)

Create a TensorFlow model for a Convolutional Neural Network. This network should consist of two convolutional layers followed by a fully connected layer of the form:

conv_layer1 → conv_layer2 → fully-connected → output

Your function should take two parameters X and Y that are TensorFlow placeholders as defined in input_placeholder() andtarget_placeholder(). It should return varibles conv1, conv2, w, b, logits, preds, batch_xentropy and batch_loss, where:

conv1 is a convolutional layer of convlayer_sizes[0] filters of shape filter_shape
conv2 is a convolutional layer of convlayer_sizes[1] filters of shape filter_shape
w and b are TensorFlow variables representing the weights and biases of the final fully connected layer
logits and preds are the inputs to the final activation functions and their output
xentropy_loss is the cross-entropy loss for each image in the batch
batch_loss is the average of the cross-entropy loss for all images in the batch

Hints:

use tf.layer.conv2d
the final layer is very similar to the onelayer network, except that the input will be from the conv2 layer. If you reshape the conv2 output using tf.reshape, you should be able to call onelayer() to get the final layer of your network

Change line 7 of train.py to network = "conv" and test your network on the MNIST dataset by typing

python3 train.py

It may take several minutes to run, depending on your processor.

Notes

All TensorFlow objects, if not otherwise specified, should be explicity created with tf.float32 datatypes. Not specifying this datatype for variables and placeholders will cause your code to fail some tests.

TensorFlow provides multiple API's, at various levels of abstraction. For the specified functionality in this assignment, there are generally high level TensorFlow library calls that can be used. As we are assessing TensorFlow, functionality that is technically correct but implemented manually, using a library such as numpy, will fail tests. If you find yourself writing 50+ line methods, it may be a good idea to look for a simpler solution.

Visualizing Your Models

In addition to the output of train.py, you can view the progress of your models and the created TensorFlow graph using the TensorFlow visualization platform, TensorBoard. After beginning training, run the following command from the src directory:

python3 -m tensorflow.tensorboard --logdir=./summaries

Depending on your installation, the following command might also work:

tensorboard --logdir=./summaries

open a Web browser and navigate to http://localhost:6006
you should be able to see a plot of the train and test accuracies in TensorBoard
if you click on the histogram tab you'll also see some histograms of your weights, biases and the pre-activation inputs to the softmax in the final layer

代码分析：

import tensorflow as tf
import hw1 as qfns
import time
from datetime import datetime
from tensorflow.examples.tutorials.mnist import input_data

#options: "onelayer", "twolayer", "conv"
network  = "none"
#network = "onelayer"
#network = "twolayer"
#network = "conv"


def accuracy(sess, dataset, batch_size, X, Y, accuracy_op):
    # compute number of batches for given batch_size
    num_test_batches = dataset.num_examples // batch_size

    overall_accuracy = 0.0
    for i in range(num_test_batches):
        batch = mnist.test.next_batch(batch_size)
        accuracy_batch = \
            sess.run(accuracy_op, feed_dict={X: batch[0], Y: batch[1]})
        overall_accuracy += accuracy_batch

    return overall_accuracy/num_test_batches

def variable_summaries(var, name):
  """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
  with tf.name_scope(name+'_summaries'):　　# 大框架

    mean = tf.reduce_mean(var)
    tf.summary.scalar('mean', mean)

    with tf.name_scope('stddev'):
      stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
    tf.summary.scalar('stddev', stddev)

    tf.summary.scalar('max', tf.reduce_max(var))
    tf.summary.scalar('min', tf.reduce_min(var))
    tf.summary.histogram('histogram', var)

def train(sess, mnist, n_training_epochs, batch_size,
          summaries_op, accuracy_summary_op, train_writer, test_writer,
          X, Y, train_op, loss_op, accuracy_op):

    # compute number of batches for given batch_size
    num_train_batches = mnist.train.num_examples // batch_size

    # record starting time
    train_start = time.time()

    # Run through the entire dataset n_training_epochs times
    for i in range(n_training_epochs):　　# <-- 大循环
        # Initialise statistics
        training_loss = 0
        epoch_start = time.time()

        # Run the SGD train op for each minibatch
        for _ in range(num_train_batches):　　# <-- 小循环
            batch = mnist.train.next_batch(batch_size)
            trainstep_result, batch_loss, summary = \
                qfns.train_step(sess, batch, X, Y, train_op, loss_op, summaries_op)

            train_writer.add_summary(summary, i)　　# 每训练一次，需要记录一下训练效果
            training_loss += batch_loss

        # Timing and statistics
        epoch_duration = round(time.time() - epoch_start, 2)
        ave_train_loss = training_loss / num_train_batches

        # Get accuracy
        train_accuracy = \
            accuracy(sess, mnist.train, batch_size, X, Y, accuracy_op)
        test_accuracy = \
            accuracy(sess, mnist.test,  batch_size, X, Y, accuracy_op)

        # log accuracy at the current epoch on training and test sets
        # <train>
        train_acc_summary = sess.run(accuracy_summary_op,
                               feed_dict={accuracy_placeholder: train_accuracy})
        train_writer.add_summary(train_acc_summary, i)
        # <test>
        test_acc_summary  = sess.run(accuracy_summary_op,
                               feed_dict={accuracy_placeholder: test_accuracy})
        test_writer.add_summary(test_acc_summary, i)


        # 有意思的统一处理的技巧，感觉只是代码上少敲了几个字符而已
        [writer.flush() for writer in [train_writer, test_writer]]

        train_duration = round(time.time() - train_start, 2)
        # Output to montior training
        print('Epoch {0}, Training Loss: {1}, Test accuracy: {2}, \
　　　　　　　　　time: {3}s, total time: {4}s'.format(i, ave_train_loss,
                                     test_accuracy, epoch_duration,
                                     train_duration))
    print('Total training time: {0}s'.format(train_duration))
    print('Confusion Matrix:')
    true_class=tf.argmax(Y, 1)
    predicted_class=tf.argmax(preds_op, 1)

    cm=tf.confusion_matrix(predicted_class,true_class)
    print(sess.run(cm, feed_dict={X: mnist.test.images,
                                  Y: mnist.test.labels}))


# 定义框架 for tensorboard
def get_accuracy_op(preds_op, Y):
    with tf.name_scope('accuracy_ops'):
        correct_preds_op = tf.equal(tf.argmax(preds_op, 1), tf.argmax(Y, 1))　　# 0:行；1:列 - 返回最大值
        # 获得 一堆 true or false，下面对这些状态进行统计
        # the tf.cast sets True to 1.0, and False to 0.0. 
        # With N predictions,
        # of which M are correct, the mean will be M/N, i.e. the accuracy
        accuracy_op = tf.reduce_mean(tf.cast(correct_preds_op, tf.float32))　　# 数据类型转换
    return accuracy_op



if __name__ == "__main__":　　# 作为模块导入
    # hyperparameters
    learning_rate     = 0.001
    batch_size        = 128
    n_training_epochs = 20

    # load data
    mnist = input_data.read_data_sets('data/mnist', one_hot=True)

    # Input (X) and Target (Y) placeholders, they will be fed with a batch of
    # input and target values resepectively, from the training and test sets
    X = qfns.input_placeholder()　　# 封装
    Y = qfns.target_placeholder()

########################################################################################
# 计算图
########################################################################################

# Create the tensorflow computational graph for our model 计算图构建
    if network == "onelayer":
        w, b, logits_op, preds_op, xentropy_op, loss_op = qfns.onelayer(X, Y)
        [variable_summaries(v, name) for (v, name) in zip((w, b), ("w", "b"))]
        tf.summary.histogram('pre_activations', logits_op)
    elif network == "twolayer":
        w1, b1, w2, b2, logits_op, preds_op, xentropy_op, loss_op = \
            qfns.twolayer(X, Y)
        [variable_summaries(v, name) for (v, name) in
            zip((w1, b1, w2, b2), ("w1", "b1", "w2", "b2"))]
        tf.summary.histogram('pre_activations', logits_op)
    elif network == "conv":
        # standard conv layers
         conv1out, conv2out, w, b, logits_op, preds_op, xentropy_op, loss_op = \
         qfns.convnet(tf.reshape(X, [-1, 28, 28, 1]), Y)
         [variable_summaries(v, name) for (v, name) in ((w,"w"), (b,"b"))]
         tf.summary.histogram('pre_activations', logits_op)
    #elif network == "rollyourown":
       ## You can define your own conv net here and play around with it
    else:
        raise ValueError("Incorrect network string in line 7")

    # The training op performs a step of stochastic gradient descent on a minibatch
    # optimizer = tf.train.GradientDescentOptimizer # vanilla SGD
    # optimizer = tf.train.MomentumOptimizer # SGD with momentum
    optimizer = tf.train.AdamOptimizer  # ADAM - widely used optimiser (ref: http://arxiv.org/abs/1412.6980)
    train_op  = optimizer(learning_rate).minimize(loss_op)

########################################################################################
# 可视化
########################################################################################

# Prediction and accuracy ops
    accuracy_op = get_accuracy_op(preds_op, Y)　　# <---- for tensorboard

    # TensorBoard for visualisation 
    # Merge all the summaries and write them out to /tmp/mnist_logs (by default)
    summaries_op = tf.summary.merge_all()　　     # <---- for tensorboard

    # Separate accuracy summary so we can use train and test sets
    accuracy_placeholder = tf.placeholder(shape=[], dtype=tf.float32)
    accuracy_summary_op  = tf.summary.scalar("accuracy", accuracy_placeholder)　　# -->

    # When run, the init_op initialises any tensorflow variables
    # hint: weights and biases in our case
    init_op = tf.global_variables_initializer()

    # Get started
    sess = tf.Session()
    sess.run(init_op)

    # Initialise TensorBoard Summary writers
    dtstr = "{:%b_%d_%H-%M-%S}".format(datetime.now())
    train_writer = tf.summary.FileWriter('./summaries/'+dtstr+'/train', sess.graph)
    test_writer  = tf.summary.FileWriter('./summaries/'+dtstr+'/test')

########################################################################################
# 训练
#######################################################################################

# Train
    print('Starting Training...')
    train(sess, mnist, n_training_epochs, batch_size,
          summaries_op, accuracy_summary_op, train_writer, test_writer,
          X, Y, train_op, loss_op, accuracy_op)
    print('Training Complete')

    # Clean up
    sess.close()

其实，这是一个不错的编程框架可以借鉴：

"""
All tensorflow objects, if not otherwise specified, should be explicity
created with tf.float32 datatypes. Not specifying this datatype for variables and
placeholders will cause your code to fail some tests.

For the specified functionality in this assignment, there are generally high
level Tensorflow library calls that can be used. As we are assessing tensorflow,
functionality that is technically correct but implemented manually, using a
library such as numpy, will fail tests. If you find yourself writing 50+ line
methods, it may be a good idea to look for a simpler solution.

Along with the provided functional prototypes, there is another file,
"train.py" which calls the functions listed in this file. It trains the
specified network on the MNIST dataset, and then optimizes the loss using a
standard gradient decent optimizer. You can run this code to check the models
you create.

"""

import tensorflow as tf

def input_placeholder():
    """
    This placeholder serves as the input to the model, and will be populated
    with the raw images, flattened into single row vectors of length 784.

    The number of images to be stored in the placeholder for each minibatch,
    i.e. the minibatch size, may vary during training and testing, so your
    placeholder must allow for a varying number of rows.

    :return: A tensorflow placeholder of type float32 and correct shape
    """
    return tf.placeholder(dtype=tf.float32, shape=[None, 784],
                          name="image_input")

def target_placeholder():
    """
    This placeholder serves as the output for the model, and will be
    populated with targets for training, and testing. Each output will
    be a single one-hot row vector, of length equal to the number of
    classes to be classified (hint: there's one class for each digit)

    The number of target rows to be stored in the placeholder for each
    minibatch, i.e. the minibatch size, may vary during training and
    testing, so your placeholder must allow for a varying number of
    rows.

    :return: A tensorflow placeholder of type float32 and correct shape
    """
    return tf.placeholder(dtype=tf.float32, shape=[None, 10],
                          name="image_target_onehot")

def onelayer(X, Y, layersize=10):
    """
    Create a Tensorflow model for logistic regression (i.e. single layer NN)

    :param X: The  input placeholder for images from the MNIST dataset
    :param Y: The output placeholder for image labels
    :return: The following variables should be returned  (variables in the
    python sense, not in the Tensorflow sense, although some may be
    Tensorflow variables). They must be returned in the following order.
        w: Connection weights
        b: Biases
        logits: The input to the activation function
        preds: The output of the activation function (a probability
        distribution over the 10 digits)
        batch_xentropy: The cross-entropy loss for each image in the batch
        batch_loss: The average cross-entropy loss of the batch
    """
    return w, b, logits, preds, batch_xentropy, batch_loss

def twolayer(X, Y, hiddensize=30, outputsize=10):
    """
    Create a Tensorflow model for a Neural Network with one hidden layer

    :param X: The  input placeholder for images from the MNIST dataset
    :param Y: The output placeholder for image labels
    :return: The following variables should be returned in the following order.
        W1: Connection weights for the first layer
        b1: Biases for the first layer
        W2: Connection weights for the second layer
        b2: Biases for the second layer
        logits: The inputs to the activation function
        preds: The outputs of the activation function (a probability
        distribution over the 10 digits)
        batch_xentropy: The cross-entropy loss for each image in the batch
        batch_loss: The average cross-entropy loss of the batch
    """
    return w1, b1, w2, b2, logits, preds, batch_xentropy, batch_loss

def convnet(X, Y, convlayer_sizes=[10, 10], \
        filter_shape=[3, 3], outputsize=10, padding="same"):
    """
    Create a Tensorflow model for a Convolutional Neural Network. The network
    should be of the following structure:
    conv_layer1 -> conv_layer2 -> fully-connected -> output

    :param X: The  input placeholder for images from the MNIST dataset
    :param Y: The output placeholder for image labels
    :return: The following variables should be returned in the following order.
        conv1: A convolutional layer of convlayer_sizes[0] filters of shape filter_shape
        conv2: A convolutional layer of convlayer_sizes[1] filters of shape filter_shape
        w: Connection weights for final layer
        b: biases for final layer
        logits: The inputs to the activation function
        preds: The outputs of the activation function (a probability
        distribution over the 10 digits)
        batch_xentropy: The cross-entropy loss for each image in the batch
        batch_loss: The average cross-entropy loss of the batch

    hints:
    1) consider tf.layer.conv2d
    2) the final layer is very similar to the onelayer network. Only the input
    will be from the conv2 layer. If you reshape the conv2 output using tf.reshape,
    you should be able to call onelayer() to get the final layer of your network
    """
    return conv1, conv2, w, b, logits, preds, batch_xentropy, batch_loss

def train_step(sess, batch, X, Y, train_op, loss_op, summaries_op):
    """
    Run one step of training.

    :param sess: the current session
    :param batch: holds the inputs and target outputs for the current minibatch
    batch[0] - array of shape [minibatch_size, 784] with each row holding the
    input images
    batch[1] - array of shape [minibatch_size, 10] with each row holding the
    one-hot encoded targets
    :param X: the input placeholder
    :param Y: the output target placeholder
    :param train_op: the tensorflow operation that will run one step of training
    :param loss_op: the tensorflow operation that will return the loss of your
    model on the batch input/output

    :return: a 3-tuple: train_op_result, loss, summary
    which are the results of running the train_op, loss_op and summaries_op
    respectively.
    """
    train_result, loss, summary = \
        sess.run([train_op, loss_op, summaries_op], feed_dict={X: batch[0], Y: batch[1]})
    return train_result, loss, summary

一个诡异的GPU内存爆掉的error，原因可能是没用batch，但为什么我感觉自己已经用了batch。

wagner % python train.py 
Extracting data/mnist/train-images-idx3-ubyte.gz
Extracting data/mnist/train-labels-idx1-ubyte.gz
Extracting data/mnist/t10k-images-idx3-ubyte.gz
Extracting data/mnist/t10k-labels-idx1-ubyte.gz
Starting Training...
Epoch 0, Training Loss: 1.62412057612, Test accuracy: 0.928485576923, time: 17.7s, total time: 23.23s
Epoch 1, Training Loss: 1.53339798439, Test accuracy: 0.945412660256, time: 18.77s, total time: 47.16s
Epoch 2, Training Loss: 1.51623744648, Test accuracy: 0.957732371795, time: 16.98s, total time: 69.55s
Epoch 3, Training Loss: 1.50405928166, Test accuracy: 0.965544871795, time: 18.53s, total time: 93.31s
Epoch 4, Training Loss: 1.49580029535, Test accuracy: 0.969651442308, time: 18.25s, total time: 117.31s
Epoch 5, Training Loss: 1.49015715628, Test accuracy: 0.973858173077, time: 17.16s, total time: 139.7s
Epoch 6, Training Loss: 1.48674822159, Test accuracy: 0.977463942308, time: 17.14s, total time: 162.14s
Epoch 7, Training Loss: 1.48357855801, Test accuracy: 0.977063301282, time: 22.51s, total time: 192.37s
Epoch 8, Training Loss: 1.48128153867, Test accuracy: 0.976362179487, time: 19.72s, total time: 217.31s
Epoch 9, Training Loss: 1.4793002661, Test accuracy: 0.979166666667, time: 18.55s, total time: 241.85s
Epoch 10, Training Loss: 1.47781905789, Test accuracy: 0.981470352564, time: 17.13s, total time: 264.37s
Epoch 11, Training Loss: 1.47661569985, Test accuracy: 0.978265224359, time: 17.79s, total time: 287.87s
Epoch 12, Training Loss: 1.47536480844, Test accuracy: 0.978966346154, time: 17.81s, total time: 310.88s
Epoch 13, Training Loss: 1.47433425524, Test accuracy: 0.979366987179, time: 17.64s, total time: 334.69s
Epoch 14, Training Loss: 1.47369961472, Test accuracy: 0.981069711538, time: 21.16s, total time: 361.12s
Epoch 15, Training Loss: 1.47281188009, Test accuracy: 0.980168269231, time: 17.61s, total time: 383.81s
Epoch 16, Training Loss: 1.47218408007, Test accuracy: 0.979767628205, time: 18.57s, total time: 407.58s
Epoch 17, Training Loss: 1.4718299026, Test accuracy: 0.979767628205, time: 17.71s, total time: 431.23s
Epoch 18, Training Loss: 1.47149271065, Test accuracy: 0.978265224359, time: 17.27s, total time: 454.74s
Epoch 19, Training Loss: 1.4708190759, Test accuracy: 0.979467147436, time: 18.98s, total time: 481.63s
Total training time: 481.63s
Confusion Matrix:
Traceback (most recent call last):
  File "train.py", line 174, in <module>
    X, Y, train_op, loss_op, accuracy_op)
  File "train.py", line 93, in train
    Y: mnist.test.labels}))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,28,28,10]
	 [[Node: conv2/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape, conv2/kernel/read)]]

Caused by op u'conv2/convolution', defined at:
  File "train.py", line 132, in <module>
    qfns.convnet(tf.reshape(X, [-1, 28, 28, 1]), Y)
  File "/tmp_amd/adams/export/adams/2/z5004703/homework/submit/9444/assi01/hw1.py", line 161, in convnet
    name              ="conv2")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 509, in conv2d
    return layer.apply(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 303, in apply
    return self.__call__(inputs, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 273, in __call__
    outputs = self.call(inputs, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 156, in call
    data_format=utils.convert_data_format(self.data_format, self.rank + 2))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 639, in convolution
    op=op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 308, in with_space_to_batch
    return op(input, num_spatial_dims, padding)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 631, in op
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 129, in _non_atrous_convolution
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,28,28,10]
	 [[Node: conv2/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape, conv2/kernel/read)]]

Error log

相似问题：https://stackoverflow.com/questions/39076388/tensorflow-deep-mnist-resource-exhausted-oom-when-allocating-tensor-with-shape

哪里显示了batch的信息？

From: https://www.tensorflow.org/tutorials/layers

def cnn_model_fn(features, labels, mode):

  """Model function for CNN."""
  # Input Layer 变为二维好理解的形式
  input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])

  # Convolutional Layer #1
  conv1 = tf.layers.conv2d(
      inputs=input_layer,　　# [batch_size, image_width,image_height, color_channels]
      filters=32,
      kernel_size=[5, 5],
      padding="same",
      activation=tf.nn.relu)

  # Pooling Layer #1
  pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)　　# 关心两个参数

  # Convolutional Layer #2 and Pooling Layer #2
  conv2 = tf.layers.conv2d(
      inputs      = pool1,
      filters     = 64,　　# (here, 32)
      kernel_size = [5, 5],
      padding     = "same",
      activation  = tf.nn.relu)
  pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

  # Dense Layer
  pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])　　# 展开后才能全连接
  dense      = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
  dropout    = tf.layers.dropout(　　                 # 相当于对dense layer添加了dropout属性
      inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

  ######################################
  # dense layer --> logits --> softmax 
  # 为何这么搞，感觉没有必要的样子
  ######################################

  # Logits Layer
  logits = tf.layers.dense(inputs=dropout, units=10)　　# 1024 -> 10 全连接

  predictions = {
      # Generate predictions (for PREDICT and EVAL mode)
      "classes": tf.argmax(input=logits, axis=1),
      # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
      # `logging_hook`.
      "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
  }

  if mode == tf.estimator.ModeKeys.PREDICT:　　# 如果是prediction，那就重新设置下属性为predict
    return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)


  # Calculate Loss (for both TRAIN and EVAL modes) 【为何把label in loss搞的这么复杂？】
  onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
  loss = tf.losses.softmax_cross_entropy(
      onehot_labels=onehot_labels, logits=logits)


  # Configure the Training Op (for TRAIN mode)
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
    train_op  = optimizer.minimize(　　# 总是成对出现
        loss=loss,
        global_step=tf.train.get_global_step())　　# <-- 做什么用的呢？
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

  # Add evaluation metrics (for EVAL mode)
  eval_metric_ops = {
      "accuracy": tf.metrics.accuracy(
          labels=labels, predictions=predictions["classes"])}
  return tf.estimator.EstimatorSpec(
      mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

附录

不错的总结：http://www.jianshu.com/p/e112012a4b2d

上文中重难点总结。

通过 with tf.name_scope 构造流程图

【5. 可视化 Tensorboard】

Tensorflow 自带 tensorboard ，可以自动显示我们所建造的神经网络流程图：

就是用 with tf.name_scope 定义各个框架，注意看代码注释中的区别：

import tensorflow as tf

def add_layer(inputs, in_size, out_size, activation_function=None):
    # add one more layer and return the output of this layer
    # 区别：大框架，定义层 layer，里面有 小部件
    with tf.name_scope('layer'):
        # 区别：小部件
        with tf.name_scope('weights'):
            Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W')
        with tf.name_scope('biases'):
            biases = tf.Variable(tf.zeros([1, out_size]) + 0.1, name='b')
        with tf.name_scope('Wx_plus_b'):
            Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
        if activation_function is None:
            outputs = Wx_plus_b
        else:
            outputs = activation_function(Wx_plus_b, )
        return outputs


# define placeholder for inputs to network
# 区别：大框架，里面有 inputs x，y
with tf.name_scope('inputs'):　　# <----
    xs = tf.placeholder(tf.float32, [None, 1], name='x_input')
    ys = tf.placeholder(tf.float32, [None, 1], name='y_input')

# add hidden layer
hidden_layer = add_layer(xs, 1, 10, activation_function=tf.nn.relu)　　  # <-- 内部自带 tf.name_scope
# add output layer
prediction   = add_layer(hidden_layer, 10, 1, activation_function=None) # <-- 内部自带 tf.name_scope

# the error between prediciton and real data
# 区别：定义框架 loss
with tf.name_scope('loss'):　　 # <----
    loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
                                        reduction_indices=[1]))

# 区别：定义框架 train
with tf.name_scope('train'):　　# <----
    train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)


sess = tf.Session()

# 区别：sess.graph 把所有框架加载到文件夹"logs/"中的一个文件里
# 接着打开terminal，进入你存放的文件夹地址上一层，运行命令 tensorboard --logdir='logs/'
# 会返回一个地址，然后用浏览器打开这个地址，在 graph 标签栏下打开
writer = tf.train.SummaryWriter("logs/", sess.graph)
# important step
sess.run(tf.initialize_all_variables())

运行完上面代码后，打开 terminal，进入你存放的文件夹地址上一层，运行命令 tensorboard --logdir='logs/' 后会返回一个地址，然后用浏览器打开这个地址，点击 graph 标签栏下就可以看到流程图了：

可视化之 summary_op

From: http://blog.csdn.net/smf0504/article/details/56369758

TensorBoard可以将训练过程中的各种绘制数据展示出来，包括标量（scalars），图片（images），音频（Audio）,计算图（graph）,数据分布，直方图（histograms）和嵌入式向量。

使用TensorBoard展示数据，

1. 需要在执行Tensorflow计算图的过程中，将各种类型的数据汇总并记录到日志文件中。

2. 然后使用TensorBoard读取这些日志文件，解析数据并生产数据可视化的Web页面，让我们可以在浏览器中观察各种汇总数据。

summary_op包括了summary.scalar、summary.histogram、summary.image等操作，

1. 这些操作输出的是各种summary protobuf，

2. 最后通过summary.writer写入到event文件中。

对标量数据汇总和记录使用tf.summary.scalar，函数格式如下：

tf.summary.scalar(tags, values, collections=None, name=None)

使用tf.summary.histogram直接记录变量var的直方图，输出带直方图的汇总的protobuf，函数格式如下：

tf.summary.histogram(tag, values, collections=None, name=None）

输出带图像的probuf，汇总数据的图像的的形式如下： ' tag /image/0', ' tag /image/1', etc.，如：input/image/0等

tf.summary.image(tag, tensor, max_images=3, collections=None, name=None)

将上面几种类型的汇总再进行一次合并，具体合并哪些由inputs指定，格式如下：

tf.summary.merge(inputs, collections=None, name=None)

合并默认图形中的所有汇总：

tf.summaries.merge_all(key='summaries')

将汇总的protobuf写入到event文件中去的相关的类： SummaryWriter是一个类，它可以调用以下成员函数来往event文件中添加相关的数据：addsummary(), add sessionlog(), add_event(), or add_graph()

tf.summary.FileWriter

这里注意，计算图形的信息通过add_graph写入到event文件中。

举个栗子，嘿嘿：

# coding=utf-8

import tensorflow as tf
"""
首先载入Tensorflow，并设置训练的最大步数为1000,学习率为0.001,dropout的保留比率为0.9。
同时，设置MNIST数据下载地址data_dir和汇总数据的日志存放路径log_dir。
这里的日志路径log_dir非常重要，会存放所有汇总数据供Tensorflow展示。
"""

from tensorflow.examples.tutorials.mnist import input_data
max_step      = 1000
learning_rate = 0.001
dropout       = 0.9
data_dir = '/tmp/tensorflow/mnist/input_data'
log_dir  = 'tmp/tensorflow/mnist/logs/mnist_with_summaries'

# 使用input_data.read_data_sets下载MNIST数据，并创建Tensorflow的默认Session
mnist = input_data.read_data_sets(data_dir, one_hot=True)
sess  = tf.InteractiveSession()　　# 如下

# 唯一的区别在于：tf.InteractiveSession()加载它自身作为默认构建的session，tensor.eval()和operation.run()取决于默认的session.

# 换句话说：InteractiveSession 输入的代码少，原因就是它允许变量不需要使用session就可以产生结构。



"""
为了在TensorBoard中展示节点名称，设计网络时会常使用tf.name_scope限制命名空间，
在这个with下所有的节点都会自动命名为input/xxx这样的格式。
定义输入x和y的placeholder，并将输入的一维数据变形为28×28的图片存储到另一个tensor， # <---- 反卷积？
这样就可以使用tf.summary.image将图片数据汇总给TensorBoard展示了。
"""
with tf.name_scope('input'):
    x = tf.placeholder(tf.float32, [None, 784], name='x_input')
    y = tf.placeholder(tf.float32, [None, 10],  name='y_input')

with tf.name_scope('input_reshape'):
    image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
    tf.summary.image('input', image_shaped_input, 10)

# 定义神经网络模型参数的初始化方法，
# 权重依然使用常用的truncated_normal进行初始化，偏置则赋值为0.1
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)　　# 从截断的正态分布中随机取值
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

# 定义对Variable变量的数据汇总函数
"""
计算出Variable的mean,stddev,max和min，
对这些标量数据使用tf.summary.scalar进行记录和汇总。
同时，使用tf.summary.histogram直接记录变量var的直方图。
"""
def variable_summaries(var):
    with tf.name_scope('summaries'):
        mean = tf.reduce_mean(var)
        tf.summary.scalar('mean', mean)

        with tf.name_scope('stddev'):
            stddev = tf.sqrt(tf.reduce_mean(tf.square(var-mean)))
        tf.summary.scalar('stddev', stddev)
        tf.summary.scalar('max', tf.reduce_max(var))
        tf.summary.scalar('min', tf.reduce_min(var))
        tf.summary.histogram('histogram', var)

# 设计一个MLP多层神经网络来训练数据，在每一层中都会对模型参数进行数据汇总。
"""
创建一个one-layer NN 并进行数据汇总的函数nn_layer。
这个函数的输入参数有：输入数据input_tensor, 输入的维度input_dim, 输出的维度output_dim 和 层名称layer_name，激活函数act则默认使用Relu。
在函数内，显示初始化这层神经网络的权重和偏置，并使用前面定义的variable_summaries对variable进行数据汇总。
然后对输入做矩阵乘法并加上偏置，再将未进行激活的结果使用tf.summary.histogram统计直方图。
同时，在使用激活函数后，再使用tf.summary.histogram统计一次。
"""
def nn_layer(input_tensor, input_dim, output_dim, layer_name,act=tf.nn.relu):
    with tf.name_scope(layer_name):
        with tf.name_scope('weight'):
            weights = weight_variable([input_dim, output_dim])
            variable_summaries(weights)
        with tf.name_scope('biases'):
            biases = bias_variable([output_dim])
            variable_summaries(biases)
        with tf.name_scope('Wx_plus_b'):
            preactivate = tf.matmul(input_tensor, weights) + biases
            tf.summary.histogram('pre_activations', preactivate)
        activations = act(preactivate, name='actvations')
        tf.summary.histogram('activations', activations)
        return activations

"""
使用刚定义好的nn_layer创建一层神经网络，输入维度是图片的尺寸（784=24×24），输出的维度是隐藏节点数500.
再创建一个Droput层，并使用tf.summary.scalar记录keep_prob。然后再使用nn_layer定义神经网络的输出层，激活函数为全等映射，此层暂时不使用softmax,在后面会处理。
"""
hidden1 = nn_layer(x, 784, 500, 'layer1')

with tf.name_scope('dropout'):
    keep_prob = tf.placeholder(tf.float32)
    tf.summary.scalar('dropout_keep_probability', keep_prob)
    dropped = tf.nn.dropout(hidden1, keep_prob)

y1 = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)

"""
这里使用tf.nn.softmax_cross_entropy_with_logits()对前面输出层的结果进行softmax处理并计算交叉熵损失cross_entropy。
计算平均损失，并使用tf.summary.saclar进行统计汇总。
"""
with tf.name_scope('cross_entropy'):
    diff = tf.nn.softmax_cross_entropy_with_logits(logits=y1, labels=y)
    with tf.name_scope('total'):
        cross_entropy = tf.reduce_mean(diff)
tf.summary.scalar('cross_entropy', cross_entropy)

"""
使用Adma优化器对损失进行优化，同时统计预测正确的样本数并计算正确率accuray，
再使用tf.summary.scalar对accuracy进行统计汇总。
"""
with tf.name_scope('train'):
    train_step = tf.train.AdamOptimizer(learning_rate).minimize(cross_entropy)
with tf.name_scope('accuracy'):
    with tf.name_scope('correct_prediction'):
        correct_prediction = tf.equal(tf.argmax(y1, 1), tf.arg_max(y, 1))
    with tf.name_scope('accuracy'):
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', accuracy)

"""
由于之前定义了非常多的tf.summary的汇总操作，一一执行这些操作态麻烦，
所以这里使用tf.summary.merger_all()直接获取所有汇总操作，以便后面执行。
然后，定义两个tf.summary.FileWrite(文件记录器)在不同的子目录，分别用来存放训练和测试的日志数据。
同时，将Session的计算图sess.graph加入训练过程的记录器，这样在TensorBoard的GRAPHS窗口中就能展示整个计算图的可视化效果。
最后使用tf.global_variables_initializer().run()初始化全部变量。
"""
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(log_dir + '/train', sess.graph)
test_writer = tf.summary.FileWriter(log_dir + '/test')
tf.global_variables_initializer().run()

"""
定义feed_dict的损失函数。
该函数先判断训练标记，如果训练标记为true,则从mnist.train中获取一个batch的样本，并设置dropout值;
如果训练标记为False，则获取测试数据，并设置keep_prob为1,即等于没有dropout效果。
"""
def feed_dict(train):
    if train:
        xs, ys = mnist.train.next_batch(100)
        k = dropout
    else:
        xs, ys = mnist.test.images, mnist.test.labels
        k = 1.0
    return {x: xs, y: ys, keep_prob: k}

# 实际执行具体的训练，测试及日志记录的操作
"""
首先，使用tf.train.Saver()创建模型的保存器。
然后，进入训练的循环中，每隔10步执行一次merged（数据汇总），accuracy（求测试集上的预测准确率）操作，
并使应test_write.add_summary将汇总结果summary和循环步数i写入日志文件;
同时每隔100步，使用tf.RunOption定义Tensorflow运行选项，其中设置trace_level为FULL——TRACE,
并使用tf.RunMetadata()定义Tensorflow运行的元信息，
这样可以记录训练是运算时间和内存占用等方面的信息.
再执行merged数据汇总操作和train_step训练操作，将汇总summary和训练元信息run_metadata添加到train_writer.
平时，则执行merged操作和train_step操作，并添加summary到trian_writer。
所有训练全部结束后，关闭train_writer和test_writer。
"""
saver = tf.train.Saver()
for i in range(max_step):
    if i % 10 == 0:
        summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
        test_writer.add_summary(summary, i)
        print('Accuracy at step %s: %s' % (i, acc))
    else:
        if i % 100 == 99:
            run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
            run_metadata = tf.RunMetadata()
            summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True),
                                  options=run_options, run_metadata=run_metadata)
            train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
            train_writer.add_summary(summary, i)
            saver.save(sess, log_dir+"/model.ckpt", i)
            print('Adding run metadata for', i)
        else:
            summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
            train_writer.add_summary(summary, i)
train_writer.close()
test_writer.close()

posted @ 2017-08-13 20:14 郝壹贰叁阅读(372) 评论(0) 收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

[Tensorflow] Digit Recognition with few layers NN

入门级实践

Project 1 - Digit Recognition

Getting Started

Stage 0: Provided Code

Stage 1: Single-Layer Network (3 marks)

Stage 2: Two-Layer Network (3 marks)

Stage 4: Convolutional Network (4 marks)

Notes

Visualizing Your Models

附录

通过 with tf.name_scope 构造流程图

【5. 可视化 Tensorboard】

可视化之 summary_op

公告

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

[Tensorflow] Digit Recognition with few layers NN

入门级实践

Project 1 - Digit Recognition

Getting Started

Stage 0: Provided Code

Stage 1: Single-Layer Network (3 marks)

Stage 2: Two-Layer Network (3 marks)

Stage 4: Convolutional Network (4 marks)

Notes

Visualizing Your Models

附录

通过 with tf.name_scope 构造流程图

【5. 可视化 Tensorboard】

可视化 之 summary_op

公告

可视化之 summary_op