theano学习指南1（翻译）

theano学习指南，主要翻译官方文档

基础知识

本学习指南不是一份机器学习的教程，但是首先我们会对其中的概念做一个简单的回顾，以确保我们在相同的起跑线上。大家还需要下载几个数据库，以便于跑这个指南里面的程序。

theano下载安装

在学习每一个算法的时候，大家都需要下载安装相应的文件，如果你想要一次下载所有的文件，可以通过下面这种方式

git clone git://github.com/lisa-lab/DeepLearningTutorials.git

数据库

MNIST数据集（mnist.pkl.gz）

MNIST数据集由手写的数字的图像组成，它分为了60,000训练数据和10,000个测试数据。在很多文献以及这个指南里面，官方的训练数据又进一步的分成50,000的训练数据和10,000的验证数据，以便于模型参数的选择。所有的图像都做了规范化的处理，每个图像的大小都是28*28.在原始数据中，图像的像素存成常用的灰度图（灰度区间0~255）。

为了方便在python中调用改数据集，我们对其进行了序列化。序列化后的文件包括三个list，训练数据，验证数据和测试数据。list中的每一个元素都是由图像和相应的标注组成的。其中图像是一个784维（28*28）的numpy数组，标注则是一个0-9之间的数字。下面的代码演示了如何使用这个数据集。

import cPickle, gzip, numpy

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

在使用这个数据集的时候，我们一般把它分成若干minibatch。我们也鼓励你吧数据集存成共享变量，并根据minibatch的索引来访问它。这样做是为了在GPU上运行代码的方便。当复制代码到GPU上时，数据会有很大的重叠。如果你按照程序请求来复制数据，而不是通过共享变量的方式，GPU上面的程序就不会比运行在CPU上面的快。如果你运用theano的共享数据，就使得theano可以通过一个调用复制所有数据到GPU上。（有些说明没翻译，对GPU的原理不是很理解-译者）

到目前为止，数据保存到了一个变量中，minibatch则是这个变量的一系列的切片，它最自然的定义方法是这个切片的位置和大小。在我们的设置汇总，每个块的大小都是固定的，所以函数只要通过切片的位置就可以访问每个minibatch。下面的代码演示了如果存储数据及minibatch。

def shared_dataset(data_xy):
    """ Function that loads the dataset into shared variables

    The reason we store our dataset in shared variables is to allow
    Theano to copy it into the GPU memory (when code is run on GPU).
    Since copying data into the GPU is slow, copying a minibatch everytime
    is needed (the default behaviour if the data is not in a shared
    variable) would lead to a large decrease in performance.
    """
    data_x, data_y = data_xy
    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
    # When storing data on the GPU it has to be stored as floats
    # therefore we will store the labels as ``floatX`` as well
    # (``shared_y`` does exactly that). But during our computations
    # we need them as ints (we use labels as index, and if they are
    # floats it doesn't make sense) therefore instead of returning
    # ``shared_y`` we will have to cast it to int. This little hack
    # lets us get around this issue
    return shared_x, T.cast(shared_y, 'int32')

test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)

batch_size = 500    # size of the minibatch
# accessing the third minibatch of the training set

data  = train_set_x[2 * 500: 3 * 500]
label = train_set_y[2 * 500: 3 * 500]

符号

数据集符号

首先，我们用 $\mathbf{D}$来表示数据集，为了区分的方便，训练，验证和测试数据可以分别用$\mathbf{D_{train}}$，$\mathbf{D_{valid}}$， $\mathbf{D_{test}}$来表示。

本指南着眼于分类问题，对于每一个数据集，都有一些数据对（$x^{(i)},y^{(i)}$）组成。其中$x^{(i)} \in R^D$为特征向量，$y^{(i)} \in (0~L)$ 表示了数据$x^{(i)}$的类别。

对于其他符号，如无特殊说明，做如下约定，

$\mathbf{W}$ 大写符号表示矩阵
$W_{ij}$ 矩阵第i行，第j列的元素
$W_{i.}$ 行向量
$W_{.j}$ 列向量
$b$ 向量
$b_i$ 向量的元素

符号和函数的定义列表如下

$D$ 输入向量的维度
$D_h^{i}$ 第i层隐变量的个数
$f_\theta(x), f(x)$ 分类函数
L 标注的个数
$L(\theta,D)$ 模型似然函数的对数形式
$l(\theta,D)$ 预测函数的经验损失
NLL 负的以对数表示的似然函数
$\theta$ 模型的参数集合

Python名字空间

本指南的程序一般引用如下名字空间

import theano
import theano.tensor as T
import numpy

监督优化问题入门

在深度学习中，深度网络的无监督学习得到了广泛的应用。但是监督学习仍然扮演着重要角色。本章节简单的回顾一下分类问题的监督学习模型，并且介绍在theano下面随机梯度下降算法的实现。

分类器的学习

0-1损失

在本指南中介绍的方法也常常用于一般的分类问题中。训练一个分类器的目的是最小化预测函数在测试实例上面的错误。这种错误最简单的表示方法是0-1损失。如果预测函数定义为$f: \mathbf{R^D} -> {0,...,\mathbf{L}}$，那么损失函数可以表示为：

$$l_{0,1}=\sum_{i=0}^{|D|} {I_{f(x^i \neq y^i)} }$$

这里，$D$ 可以是训练过程中的训练数据，或者和训练数据没有任何交集，以避免验证或测试过程中的偏差。指标函数$I$定义为：

$$I_x = \left\{\begin{array}{ccc} 1&\mbox{ if $x$ is True} \\ 0&\mbox{ otherwise}\end{array}\right.$$

在本指南中，预测函数定于为：

$$f(x) = argmax_k{P(Y=k|x,\theta)}$$

在python中，结合Theano，该函数的实现如下：

# zero_one_loss is a Theano variable representing a symbolic
# expression of the zero one loss ; to get the actual value this
# symbolic expression has to be compiled into a Theano function (see
# the Theano tutorial for more details)
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))

负对数似然损失

因为0-1损失函数是不可微的，在一个含有几千甚至几万个参数的复杂问题中，模型的求解变得非常困难。因此我们最大化分类器的对数似然函数：

$$L(\theta,D) = \sum_{i=0}^{|D|} {log P(Y= y^i|x^i,\theta)}$$

正确类别的似然，并不和正确预测的数目完全一致，但是，从随机初始化的分类器的角度看，他们是非常类似的。但是请记住，似然函数和0-1损失函数是不同的，你应该看到他们的在验证数据上面的正相关性，但是有时候又是负相关。（这段是不是很明白）

既然我们可以最小化损失函数，那么学习的过程，也就是最小化负的对数似然函数的过程：

$$NLL(\theta,D) = \sum_{i=0}^{|D|} {log P(Y= y^i|x^i,\theta)}$$

NLL函数其实是0-1损失函数的一种可以微分的替代，这样我们就可以用它在训练集合的梯度来训练分类器。相应的代码如下：

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.

随机梯度下降算法

什么是一般的梯度下降呢？如果我们定义了损失函数，这种方法在错误平面上面，重复地小幅的向下移动参数，以达到最优化的目的。通过梯度下降，训练数据在损失函数上面达到极值，相应的伪代码如下：

# GRADIENT DESCENT

while True:
    loss = f(params)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

随机梯度下降（SGD）也遵从类似的原理，但是它每次估计梯度的时候，只采用一小部分训练数据，因而处理速度更快，相应的伪代码如下：

# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
                            # imagine an infinite generator
                            # that may repeat examples (if there is only a finite training set)
    loss = f(params, x_i, y_i)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

当在深度学习中采用minibatch的时候，SGD稍微有一点变化。在minibatch SGD中，我们每次用多个训练数据来估计梯度。这种技术减少了估计的梯度方差，也充分的利用了现在计算机体系结构中的内存的层次化组织技术。

for (x_batch,y_batch) in train_batches:
                            # imagine an infinite generator
                            # that may repeat examples
    loss = f(params, x_batch, y_batch)
    d_loss_wrt_params = ... # compute gradient using theano
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

以上的伪代码描述了算法是如何工作的，在Theano平台下的具体实现为：

# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
    # here x_batch and y_batch are elements of train_batches and
    # therefore numpy arrays; function MSGD also updates the params
    print('Current loss is ', MSGD(x_batch, y_batch))
    if stopping_condition_is_met:
        return params

规则化

机器学习要优化复杂一些。我们从一些数据上面训练模型的目的，是要把它应用到新的数据上面。但是前面的训练算法并没有考虑这一点，这有可能引起训练过度的问题。一种解决训练过度的办法是规则化，有几种技术可以实现，这里我们主要介绍L1/L2规则化，以及提前结束训练的技术。

L1/L2规则化

这种技术主要是在损失函数上面添加一项，从而达到对相关的参数的惩罚的目的。假设我们的损失函数为：

$$NLL(\theta,D) = \sum_{i=0}^{|D|} {log P(Y= y^i|x^i,\theta)}$$

那么规则化的后的损失函数可以定义为：

$$E(\theta,D) = NLL(\theta,D) + \lambda R(\theta)$$

在我们的问题，函数可以具体定义为：

$$E(\theta,D) = NLL(\theta,D) + \lambda ||\theta_p^p||$$

这里，

$$||\theta||_p = \left(\sum_{j=0}^{|\theta|}{|\theta_j|^p}\right)^{\frac{1}{p}}$$

为参数$\theta$的$L_p$范数。通常p的取值为1或者2。当p=2的是，规范化又称权衰减。

应该注意的是，这种简单的方法并不一定意味着模型的泛化。在实际应用过程中，人们发现在神经网络中应用这种技术有助于泛化，特别是小数据集上面。下面的代码演示了如何应用这种技术。

# symbolic Theano variable that represents the L1 regularization term
L1  = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2_sqr = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2

提前结束训练

提前结束训练是另一种处理训练过度的办法，它的解决思路是监测模型在验证数据上的表现。验证数据在训练过程中，可以用来做测试数据。如果模型的性能在验证数据中改进很小，真是变差，那么就应该放弃进一步的优化。

停止优化的判别有很多方法，在这个指南中，我们用一种基于patience(???)几何增长的策略。

# early-stopping parameters
patience = 5000  # look as this many examples regardless
patience_increase = 2     # wait this much longer when a new best is
                              # found
improvement_threshold = 0.995  # a relative improvement of this much is
                               # considered significant
validation_frequency = min(n_train_batches, patience/2)
                              # go through this many
                              # minibatches before checking the network
                              # on the validation set; in this case we
                              # check every epoch

best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()

done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
    # Report "1" for first epoch, "n_epochs" for last epoch
    epoch = epoch + 1
    for minibatch_index in xrange(n_train_batches):

        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params # gradient descent

        # iteration number. We want it to start at 0.
        iter = (epoch - 1) * n_train_batches + minibatch_index
        # note that if we do `iter % validation_frequency` it will be
        # true for iter = 0 which we do not want. We want it true for
        # iter = validation_frequency - 1.
        if (iter + 1) % validation_frequency == 0:

            this_validation_loss = ... # compute zero-one loss on validation set

            if this_validation_loss < best_validation_loss:

                # improve patience if loss improvement is good enough
                if this_validation_loss < best_validation_loss * improvement_threshold:

                    patience = max(patience, iter * patience_increase)
                best_params = copy.deepcopy(params)
                best_validation_loss = this_validation_loss

        if patience <= iter:
            done_looping = True
            break

# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization

posted on 2013-04-03 11:07 xueliangliu 阅读(27875) 评论(2) 收藏举报

刷新页面返回顶部

xueliangliu