deeplearning----学习一个简单的分类器
零一损失
我们的目的就是让错误次数(零一损失)尽可能的少:
f(x)会得出在当前的theata条件下输入对应的最大概率的输出值。换言之,我们从x预测出f(x),如果这个值就是y,那么预测成功,反之失败。
# zero_one_loss is a Theano variable representing a symbolic # expression of the zero one loss ; to get the actual value this # symbolic expression has to be compiled into a Theano function (see # the Theano tutorial for more details) zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
#neq是I函数,T.neq(x,y)判断两个值是否不相等,not equal?
负对数自然损失
由于0-1损失是不可微的,在大型模型中去优化它相当耗费资源,因此我们最大化它的对数似然函数来完成(似然就是可能性):
也就是最小化负对数似然损失
负对数似然函数:negative log-likelihood (NLL)
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic # expression has to be compiled into a Theano function (see the Theano # tutorial for more details) NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y]) # note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)]. # Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the # elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this # syntax to retrieve the log-probability of the correct labels, y.
随机梯度下降SGD(Stochastic Gradient Descent)
# GRADIENT DESCENT while True: loss = f(params) d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params
上面是一般梯度下降,基本思路是:损失--》梯度--》参数更新
随机梯度下降是一次选几个样本进行训练。最简单的方式是一次一个:
# STOCHASTIC GRADIENT DESCENT for (x_i,y_i) in training_set: # imagine an infinite generator # that may repeat examples (if there is only a finite training set) loss = f(params, x_i, y_i) d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params
Minibatch SGD 除了一次使用多个样本,其他和sgd都一样
or (x_batch,y_batch) in train_batches: # imagine an infinite generator # that may repeat examples loss = f(params, x_batch, y_batch) d_loss_wrt_params = ... # compute gradient using theano params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params
上面给出的都是伪代码,完整代码如下:
# Minibatch Stochastic Gradient Descent # assume loss is a symbolic description of the loss function given # the symbolic variables params (shared variable), x_batch, y_batch; # compute gradient of loss with respect to params d_loss_wrt_params = T.grad(loss, params) # compile the MSGD step into a theano function updates = [(params, params - learning_rate * d_loss_wrt_params)] MSGD = theano.function([x_batch,y_batch], loss, updates=updates) for (x_batch, y_batch) in train_batches: # here x_batch and y_batch are elements of train_batches and # therefore numpy arrays; function MSGD also updates the params print('Current loss is ', MSGD(x_batch, y_batch)) if stopping_condition_is_met: return params
正则化
我们希望模型能够用到其他数据上。为防止训练过度的问题(参数变的异常大),我们进行正则化,这里介绍L1/L2正则化,以及提前结束训练的方法
对于我们的问题,可以具体定义为:
其中
观察可发现:当p=1的时候,就是绝对值和;p=2的时候就是根号平方和。
# symbolic Theano variable that represents the L1 regularization term L1 = T.sum(abs(param)) # symbolic Theano variable that represents the squared L2 term L2_sqr = T.sum(param ** 2) # the loss loss = NLL + lambda_1 * L1 + lambda_2 * L2
提前结束训练
# early-stopping parameters patience = 5000 # look as this many examples regardless patience_increase = 2 # wait this much longer when a new best is # found improvement_threshold = 0.995 # a relative improvement of this much is # considered significant validation_frequency = min(n_train_batches, patience/2) # go through this many # minibatches before checking the network # on the validation set; in this case we # check every epoch best_params = None best_validation_loss = numpy.inf test_score = 0. start_time = time.clock() done_looping = False epoch = 0 while (epoch < n_epochs) and (not done_looping): # Report "1" for first epoch, "n_epochs" for last epoch epoch = epoch + 1 for minibatch_index in xrange(n_train_batches): d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params # gradient descent # iteration number. We want it to start at 0. iter = (epoch - 1) * n_train_batches + minibatch_index # note that if we do `iter % validation_frequency` it will be # true for iter = 0 which we do not want. We want it true for # iter = validation_frequency - 1. if (iter + 1) % validation_frequency == 0: this_validation_loss = ... # compute zero-one loss on validation set if this_validation_loss < best_validation_loss: # improve patience if loss improvement is good enough if this_validation_loss < best_validation_loss * improvement_threshold: patience = max(patience, iter * patience_increase) best_params = copy.deepcopy(params) best_validation_loss = this_validation_loss if patience <= iter: done_looping = True break # POSTCONDITION: # best_params refers to the best out-of-sample parameters observed during the optimization