Exercise : Stacked Autoencoder

参考网页：http://deeplearning.stanford.edu/wiki/index.php/Exercise:_Implement_deep_networks_for_digit_classification

实验介绍：完成手写体识别，采用MNIST手写数据库，数字从0-9，训练样本6万个，测试样本1万个。每个样本由一个大小为28*28的图片表示

　　　　环境：matlab 2010a

实验方法：

　　使用多层神经网络模型，本节使用两人隐含层，隐含层之间分开学习，依次训练隐含层，将前一个隐含层的输出作为后一个隐含层的输入。每一个隐含层的训练是一个三层的autoencoder模型，为无监督学习。

　　最后一层为softmax regression模型，训练具有10个类的分类器。然后使用back-propagation模型调整参数（本章称为fine-tuning策略）

理论基础：

深度神经网络的优点：

　　1、可以指数级的减少神经元的数目，比如我们要使用神经网络模型计算XOR，如果使用单层或深度较浅的神经网络，那么需要指数级的神经元，而如果使用深层神经网络，那么神经元的数目仅为多项式级

　　2、特征学习过程是：part-whole decompositions。即第一层可能学习到边缘特征，第二层将边缘连起来检测到更长的边缘，更高层次的网络可能会检测到物体的某个部分，比如人的眼睛，鼻子等。

训练深度神经网络的困难：

1. 　availability of data: 神经网络的训练是有监督学习，需要大量的有标签的样本，随着深度的增加，训练样本的需要以指数级增长，否则会出现过拟合现象。然而有标签的样本很难大量的获得。
2. 　　local optima: 使用有监督学习训练shallow network可以得到很好的的模型参数，由于解决的是一个高阶的非凸问题，在训练deep network时，这个问题非凸问题成为严重，因此使用梯度下降等参数学习方法不能很好工作
3. diffusion of gradients: 在深度神经网络中一般使用back propagation计算梯度，即将输出层的误差依次反馈到前面的隐含层，但是随着深度的增加，梯度以相当快的速度减少，因此损失函数的梯度反馈到最前面的隐含层时已经非常小了，所学到的权重变化也非学小，此时前面的隐含层对于改善整个模型的performance的贡献就非常的小。

greedy layer-wise learning:

　　该模型也包含多个隐含层，不过每个隐含层是分开学习的，即先学习第一层，然后将第一层的输出作为第二层的输入学习第二层，以此类推，每一层的学习是一个简单三层autoencoder模型，最后一层加上softmax regression作为分类器

该模型的优点：

1. availability of data:由于autoencoder是完全的无监督学习，因此只需要无标签的样本，而无标签样本相较于有标签的样本非常容易获得，这也就是seft-taught learning最大的优势。
2. better local optima:使用无标签的样本学习到权重参数的更优的初始值 W相对于随机初始化的权值，此时使用fine-tuning策略以W为起点调整权重，显然可以得到更好的最优解，因为这里我们使用了更好的初始值。

一些问题：

　　1、同样存在前面介绍的内存的问题，所以作者使用了较少的训练样本及迭代次数

　　2、在fine-tuning阶段，仅对两个隐含层的权重进行调整，具体调整过程如下：

在步骤2中，将softmax 这一层整体作为第二个隐含层的输出，由于刚开始的时候以为对softmax层的权重也要调整，所以步骤2中的公式没有看明白，softmax层的输出使用max，而隐含层使用sigmoid 函数，因此 $f^{'}(z^{n_{l}}) f^{'}(z^{l})$ 结果是不一样的，前者是对max求导，后者对sigmoid函数求导。最后才发现原来仅需要对隐含层执行fine-tuning操作。softmax层为什么不需要权重调整呢，是由于softmax层是有监督学习，因为不需要。在计算第二个隐含层的 $\delta ^{n_{l}}$ 时，我们将softmax层当作一个整体作为它的输出，那么也就很好理解了，可以比较一下back propagation中的 $\delta _{i}^{n_{l}}$

由于第二层的输出实际是softmax的输入，因此需要考虑softmax层的权重。

3、关于学习到权重的可视化问题，隐含层L1的权重可直接使用Exercise:Sparse Autoencoder 中的display_network显示，该函数把权重W的每一列作为一个n*n的小图显示，也就要求W11的每一列的维度可开方。L2层的权重W12的显示，不能像L1一样将权重直接授给display_network，由于L2经过L1层，因此需要考虑L1层的权重，但是L1层的激活函数是非线性的，也不能将W12*W11作为学习到的特征，但可以使用它将L1,L2学习到的特征进行比较，发现L2学习到的轮廓更加明显，噪声点更少一点。

4、按道理fine-tuning后，可视化后的权重轮廓应该更加明显，但结果却更糟糕，不晓得是为什么

4、一定要根据实验需要从Exercise:Sparse Autoencoder , Exercise: Vectorization , Exercise: Softmax regression, Exercise: Self-taught Learning 章节中把需要相关代码拷过来，作者在做实验的过程中漏掉了一个文件，搞得调了很久也不晓得是什么原因，哎，太粗心大意了。贴一下需要的文件

实验结果：

fine-tuning之前的权重：

fine-tuning之后的权重：

准确率：　

Before Finetuning Test Accuracy: 80.750%
After Finetuning Test Accuracy: 87.400%

原谅我的机器内存的问题，只使用了2000个训练样本，隐含层训练时的迭代次数只有50次，所以准确率比较低，但是还是对于说明fine-tuning的重要性还是很有说服力的

实验主要代码

　　stackedAEExercise.m

%% CS294A/CS294W Stacked Autoencoder Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sstacked autoencoder exercise. You will need to complete code in
%  stackedAECost.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises. You will need the initializeParameters.m
%  loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%  
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

DISPLAY = true;
inputSize = 28 * 28;
numClasses = 10;
hiddenSizeL1 = 200;    % Layer 1 Hidden Size
hiddenSizeL2 = 200;    % Layer 2 Hidden Size
sparsityParam = 0.1;   % desired average activation of the hidden units.
                       % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
                       %  in the lecture notes). 
lambda = 3e-3;         % weight decay parameter       
beta = 3;              % weight of sparsity penalty term       

%%======================================================================
%% STEP 1: Load data from the MNIST database
%
%  This loads our training data from the MNIST database files.

% Load MNIST database files
trainData = loadMNISTImages('train-images-idx3-ubyte');
trainLabels = loadMNISTLabels('train-labels-idx1-ubyte');

trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1

%%======================================================================
%% STEP 2: Train the first sparse autoencoder
%  This trains the first sparse autoencoder on the unlabelled STL training
%  images.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.
%  Randomly initialize the parameters
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the first layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL1"
%                You should store the optimal parameters in sae1OptTheta
addpath minFunc/;
options = struct;
options.Method = 'lbfgs';
options.maxIter = 50;  % as the memory limited, the author decrease the iteration time, if you has enough memory, don't hesitate to enlarge it to 400 or more
options.display = 'on';
[sae1OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    inputSize,hiddenSizeL1,lambda,sparsityParam,beta,trainData),sae1Theta,options);%训练出第一层网络的参数
save('saves/step2.mat', 'sae1OptTheta');
% show the learned weights in hidden layer L1
if DISPLAY
  W1 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize);
  figure;
  title('first layer');
  display_network(W1');
end
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 2: Train the second sparse autoencoder
%  This trains the second sparse autoencoder on the first autoencoder
%  featurse.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.

[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
                                        inputSize, trainData);

%  Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the second layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL2" and an inputsize of
%                "hiddenSizeL1"
%
%                You should store the optimal parameters in sae2OptTheta

[sae2OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    hiddenSizeL1,hiddenSizeL2,lambda,sparsityParam,beta,sae1Features),sae2Theta,options);%训练出第一层网络的参数
save('saves/step3.mat', 'sae2OptTheta');

% -------------------------------------------------------------------------

%%======================================================================
%% STEP 3: Train the softmax classifier
%  This trains the sparse autoencoder on the second autoencoder features.
%  If you've correctly implemented softmaxCost.m, you don't need
%  to change anything here.

[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
                                        hiddenSizeL1, sae1Features);

%  Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);


%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the softmax classifier, the classifier takes in
%                input of dimension "hiddenSizeL2" corresponding to the
%                hidden layer size of the 2nd layer.
%
%                You should store the optimal parameters in saeSoftmaxOptTheta 
%
%  NOTE: If you used softmaxTrain to complete this part of the exercise,
%        set saeSoftmaxOptTheta = softmaxModel.optTheta(:);


softmaxLambda = 1e-4;
numClasses = 10;
softoptions = struct;
softoptions.maxIter = 400;
softmaxModel = softmaxTrain(hiddenSizeL2,numClasses,softmaxLambda,...
                            sae2Features,trainLabels,softoptions);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);

save('saves/step4.mat', 'saeSoftmaxOptTheta');

% before hiddent layer L2, we perform a nonliear transformation, thus it's
% quite complicate to show the weights learned in hidden layer L2 than
% L1,  the author has not figure it out.  but you can display W12*W11 to
% refer to the weights learned in L2 to some extent when you comparing the
% learned feature in L1 and L2
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 5: Finetune softmax model

% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.

% Initialize the stack using the parameters learned
stack = cell(2,1);
%其中的saelOptTheta和sae1ptTheta都是包含了sparse autoencoder的重建层网络权值的
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
                     hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
                     hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2);

% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];%stackedAETheta是个向量，为整个网络的参数，包括分类器那部分，且分类器那部分的参数放前面

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the deep network, hidden size here refers to the '
%                dimension of the input to the classifier, which corresponds 
%                to "hiddenSizeL2".
%
%

[stackedAEOptTheta, cost] =  minFunc(@(p)stackedAECost(p,inputSize,hiddenSizeL2,...
                         numClasses, netconfig,lambda, trainData, trainLabels),...
                        stackedAETheta,options);%训练出第一层网络的参数
save('saves/step5.mat', 'stackedAEOptTheta');


if DISPLAY
  optStack = params2stack(stackedAEOptTheta(hiddenSizeL2*numClasses+1:end), netconfig);
  W11 = optStack{1}.w;
  W12 = optStack{2}.w;
  figure;
  title('fine tuning: W11' );
  display_network( W11' );
  figure;
  title('fine tuning: W12');
  display_network( (W12*W11)' );
  
  % TODO(zellyn): figure out how to display a 2-level network
  % display_network(log(1 ./ (1-W11')) * W12');
end
% -------------------------------------------------------------------------

%%======================================================================
%% STEP 6: Test 
%  Instructions: You will need to complete the code in stackedAEPredict.m
%                before running this part of the code
%

% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set
testData = loadMNISTImages('t10k-images-idx3-ubyte');
testLabels = loadMNISTLabels('t10k-labels-idx1-ubyte');

testLabels(testLabels == 0) = 10; % Remap 0 to 10

[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy:  97.6%
%
% If your values are too low (accuracy less than 95%), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

stackedAECost.m

function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)
                                         
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
% i-th training example


%% Unroll softmaxTheta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end

cost = 0; % You need to compute this

% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));


%% --------------------------- YOUR CODE HERE -----------------------------
%  Instructions: Compute the cost function and gradient vector for 
%                the stacked autoencoder.
%
%                You are given a stack variable which is a cell-array of
%                the weights and biases for every layer. In particular, you
%                can refer to the weights of Layer d, using stack{d}.w and
%                the biases using stack{d}.b . To get the total number of
%                layers, you can use numel(stack).
%
%                The last layer of the network is connected to the softmax
%                classification layer, softmaxTheta.
%
%                You should compute the gradients for the softmaxTheta,
%                storing that in softmaxThetaGrad. Similarly, you should
%                compute the gradients for each layer in the stack, storing
%                the gradients in stackgrad{d}.w and stackgrad{d}.b
%                Note that the size of the matrices in stackgrad should
%                match exactly that of the size of the matrices in stack.
%

% forward propagation
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;

for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end

M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));

numCases = size(data, 2);;
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numCases * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;

%back propagation
d = cell(depth+1);
% we take the output of softmax layer as the output of the hiddenLayer L2, 
d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});
for layer = (depth:-1:2)
  d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end

for layer = (depth:-1:1)
  stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
  stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end

% -------------------------------------------------------------------------

%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

stackedAEPredict.m

function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
                                         
% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 

% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
 
%% Unroll theta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.
% input is the first layer, hidden layer L1 and L2 are the second and third
% layer respectively
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;

for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end

[~, pred] = max(softmaxTheta * a{depth+1});%use softmaxTheta to predict
% -----------------------------------------------------------

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

posted @ 2014-11-27 15:15 dupuleng 阅读(715) 评论(0) 收藏举报

刷新页面返回顶部

dupuleng

Exercise : Stacked Autoencoder

公告