7、Implement deep networks for digit classification

总结：

1）依次两个稀疏自编码后，加上一个softmax。相对于前一个实验的差距在于1、多加了一个稀疏自编码。2、加入了反向微调。

2）也算是对于梯度偏差和残差有了点更进一步的了解

3）

4）

5）

问题：

1）softmax反向传播这一块，更多就是带公式，没办法手动推导（现在也没时间），所以编写的很变扭。

2）由于之前的代码，更多是看别人编写的代码，然后写的。所以对于矩阵运算的行列对应，还理解不清楚。

3）由于对于函数minFunc，以及lbfgs不懂，所以编的还是有点不爽，特别是对于minFunc，感觉完全黑箱。

4）为什么这个实验在稀疏编码的权值中不加入权值衰减项。是由于之前在预训练自编码层的时候，已经加了权值衰减和稀疏惩罚？

5）

想法：

1）

2）

3）

4）

5）

　　UFLDL Implement deep networks for digit classification

　　实验需要下载代码：stackedae_exercise.zip

　　stackedAEExercise.m

clear;close all;clc;
disp('当前正在执行的程序是：');
disp([mfilename('fullpath'),'.m']);
%% CS294A/CS294W Stacked Autoencoder Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sstacked autoencoder exercise. You will need to complete code in
%  stackedAECost.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises. You will need the initializeParameters.m
%  loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%  
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.

inputSize = 28 * 28;
numClasses = 10;
hiddenSizeL1 = 200;    % Layer 1 Hidden Size
hiddenSizeL2 = 200;    % Layer 2 Hidden Size
sparsityParam = 0.1;   % desired average activation of the hidden units.
                       % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
		               %  in the lecture notes). 
lambda = 3e-3;         % weight decay parameter       
beta = 3;              % weight of sparsity penalty term       

%%======================================================================
%% STEP 1: Load data from the MNIST database
%
%  This loads our training data from the MNIST database files.

% Load MNIST database files
trainData = loadMNISTImages('mnist/train-images-idx3-ubyte');
trainLabels = loadMNISTLabels('mnist/train-labels-idx1-ubyte');

trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1

%%======================================================================
%% STEP 2: Train the first sparse autoencoder
%  This trains the first sparse autoencoder on the unlabelled STL training
%  images.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.


%  Randomly initialize the parameters
%sae1Theta尺寸为[314584,1],314584=2x784x200+784+200
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the first layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL1"
%                You should store the optimal parameters in sae1OptTheta

tic;
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
% options.maxIter = 5;                          
options.maxIter = 400;	  % Maximum number of iterations of L-BFGS to run 

options.display = 'on';


[sae1OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
                                   inputSize, hiddenSizeL1, ...
                                   lambda, sparsityParam, ...
                                   beta, trainData), ...
                              sae1Theta, options);
disp(['第一层，400次迭代的lbfgs训练自编码，费时：',num2str(toc)]);
                          
% -------------------------------------------------------------------------
%显示第一层，自编码学习的特征
% W1 = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), hiddenSizeL1, inputSize);
% figure;
% %display_network后面的参数12，表明的是显示图像的行数
% %W1尺寸为[200,784],下面输出的为12行28x28大小的图像
% % display_network(W1', 12); 
% %注释掉后面的12，还是把200拆分，输出12行行28x28大小的图像
% display_network(W1'); 
% set(gcf,'NumberTitle','off');
% set(gcf,'Name','第一层稀疏自编码的权系数');
% 
% print -djpeg sparseAutoencoderweights1.jpg   % save the visualization to a file 

%%======================================================================
%% STEP 2: Train the second sparse autoencoder
%  This trains the second sparse autoencoder on the first autoencoder
%  featurse.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.

[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
                                        inputSize, trainData);

%  Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the second layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL2" and an inputsize of
%                "hiddenSizeL1"
%
%                You should store the optimal parameters in sae2OptTheta

tic;
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
% options.maxIter = 5;                          
options.maxIter = 400;	  % Maximum number of iterations of L-BFGS to run 

options.display = 'on';


[sae2OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
                                   hiddenSizeL1, hiddenSizeL2, ...
                                   lambda, sparsityParam, ...
                                   beta, sae1Features), ...
                              sae2Theta, options);
disp(['第二层，400次迭代的lbfgs训练自编码，费时：',num2str(toc)]);
                          
% -------------------------------------------------------------------------
%显示第一层，自编码学习的特征
%原来想也显示第二层稀疏自编码的权系数，但是由于不了解函数 display_network.m，总是报错
%Error using reshape
% Size arguments must be real integers.
% 
% Error in display_network (line 67)
%                 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
%所以就不显示了。
% W1 = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), hiddenSizeL2, hiddenSizeL1);
% figure;
% %display_network后面的参数12，表明的是显示图像的行数
% display_network(W1'); 
% set(gcf,'NumberTitle','off');
% set(gcf,'Name','第二层稀疏自编码的权系数');
% 
% print -djpeg sparseAutoencoderweights2.jpg   % save the visualization to a file 

%%======================================================================
%% STEP 3: Train the softmax classifier
%  This trains the sparse autoencoder on the second autoencoder features.
%  If you've correctly implemented softmaxCost.m, you don't need
%  to change anything here.

[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
                                        hiddenSizeL1, sae1Features);

%  Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);


%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the softmax classifier, the classifier takes in
%                input of dimension "hiddenSizeL2" corresponding to the
%                hidden layer size of the 2nd layer.
%
%                You should store the optimal parameters in saeSoftmaxOptTheta 
%
%  NOTE: If you used softmaxTrain to complete this part of the exercise,
%        set saeSoftmaxOptTheta = softmaxModel.optTheta(:);

options.maxIter = 100;
softmaxModel = softmaxTrain(hiddenSizeL2, numClasses, lambda, ...
                            sae2Features, trainLabels, options);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);

% -------------------------------------------------------------------------



%%======================================================================
%% STEP 5: Finetune softmax model

% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.

% Initialize the stack using the parameters learned
stack = cell(2,1);
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
                     hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
                     hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2);

% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);
%模型的所有参数都传给了 stackedAETheta
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the deep network, hidden size here refers to the '
%                dimension of the input to the classifier, which corresponds 
%                to "hiddenSizeL2".
%
%


tic;
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
% options.maxIter = 5;                          
options.maxIter = 400;	  % Maximum number of iterations of L-BFGS to run 

options.display = 'on';


[stackedAEOptTheta, cost] = minFunc( @(p) stackedAECost(p, inputSize, hiddenSizeL2, ...
                                              numClasses, netconfig, ...
                                              lambda, trainData, trainLabels), ... 
                              stackedAETheta, options);
disp(['第三次，400次迭代的lbfgs的微调，费时：',num2str(toc)]);

% -------------------------------------------------------------------------



%%======================================================================
%% STEP 6: Test 
%  Instructions: You will need to complete the code in stackedAEPredict.m
%                before running this part of the code
%

% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set
testData = loadMNISTImages('mnist/t10k-images-idx3-ubyte');
testLabels = loadMNISTLabels('mnist/t10k-labels-idx1-ubyte');

testLabels(testLabels == 0) = 10; % Remap 0 to 10

[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));
fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy:  97.6%
%
% If your values are too low (accuracy less than 95%), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

　　stackedAECost.m　　

function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)
                                         
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
% i-th training example


%% Unroll softmaxTheta parameter

% We first extract the part which compute the softmax gradient
%把向量化参数theta还原为矩阵形式
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

% You will need to compute the following gradients
% 初始化梯度矩阵softmaxThetaGrad和梯度栈stackgrad
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end

cost = 0; % You need to compute this

% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));


%% --------------------------- YOUR CODE HERE -----------------------------
%  Instructions: Compute the cost function and gradient vector for 
%                the stacked autoencoder.
%
%                You are given a stack variable which is a cell-array of
%                the weights and biases for every layer. In particular, you
%                can refer to the weights of Layer d, using stack{d}.w and
%                the biases using stack{d}.b . To get the total number of
%                layers, you can use numel(stack).
%
%                The last layer of the network is connected to the softmax
%                classification layer, softmaxTheta.
%
%                You should compute the gradients for the softmaxTheta,
%                storing that in softmaxThetaGrad. Similarly, you should
%                compute the gradients for each layer in the stack, storing
%                the gradients in stackgrad{d}.w and stackgrad{d}.b
%                Note that the size of the matrices in stackgrad should
%                match exactly that of the size of the matrices in stack.
%

%先把模型结构参数拷贝出来
W1=stack{1}.w;
b1=stack{1}.b;
W2=stack{2}.w;
b2=stack{2}.b;

%前向传播
z2=bsxfun(@plus,W1*data,b1);
a2=sigmoid(z2);
z3=bsxfun(@plus,W2*a2,b2);
a3=sigmoid(z3);

%延续之前的风格，减去最大值，防止参数冗余
%如果按照之前softmax编写的风格，应该是用M作为中间变量，但是这里之前定义M为样本个数，所以用E代替。
E=exp(bsxfun(@minus,softmaxTheta*a3,max(softmaxTheta*a3,[],1)));
p=bsxfun(@rdivide,E,sum(E));

%求损失函数，这里把softmax和权值的损耗函数分别计算，实验看看加入权值损耗对于整体性能的影响。
%下面这个公式的两个sum，正好对应着先进行样本内类别累加，然后样本进行累加
J_softmax=-1/M*sum(sum(groundTruth.*log(p)))+lambda/2*sum(sum(softmaxTheta.*softmaxTheta));

%ufldl练习中说，不能正则化隐藏层中的权值，不知道为什么这样。
%J_weight=lambda/2*(sum(sum(W1.^2))+sum(sum(W2.^2)));

cost=J_softmax;

%BP
%先计算softmax的反向传播
%先求softmax的梯度，这个梯度和之前softmaxCost.m 求到的梯度一样
%这项是对于W3的梯度
softmaxThetaGrad=-1/M*(groundTruth-p)*a3'+lambda*softmaxTheta;

%求softmax层的残差,最后这项sigmoidDer(a3)，不知道softmax的求导是不是也和sigmoid一样，但只是照着别人的代码代的。
%这项相当于delta3是给W2求梯度的
deltaSoftmax=-softmaxTheta'*(groundTruth-p).*sigmoidDer(z3);
%求delta2，这项是给W1求梯度的
delta2=W2'*deltaSoftmax.*sigmoidDer(z2);

%由于命名规则不一样，求前两项的梯度不能用循环，只能分别单独编写
%以后可以试验，就对于这个实验，在前面的稀疏编码权值上加入权值衰减后结果为多少。
%下面两项没有加入权值衰减结果如下
% Before Finetuning Test Accuracy: 87.990%
% After Finetuning Test Accuracy: 97.550%
%可以模仿sparseAutoencoderCost.m 在梯度项中加入权值衰减
stackgrad{2}.w = 1/M*deltaSoftmax*a2';
stackgrad{2}.b = 1/M*sum(deltaSoftmax,2);

stackgrad{1}.w = 1/M*delta2*data';
stackgrad{1}.b = 1/M*sum(delta2,2);
% -------------------------------------------------------------------------

%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

function sigmDer = sigmoidDer(x)
  
    sigmDer = sigmoid(x).*(1-sigmoid(x));
end

　　stackedAEPredict.m

function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
                                         
% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.
                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 

% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
 
%% Unroll theta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.
%预测这块其实没啥，不过要对于整体有些了解，这样才知道行列怎么对应。
W1=stack{1}.w;
b1=stack{1}.b;
W2=stack{2}.w;
b2=stack{2}.b;

a2=sigmoid(bsxfun(@plus,W1*data,b1));
a3=sigmoid(bsxfun(@plus,W2*a2,b2));
[~,pred]=max(softmaxTheta*a3);

% -----------------------------------------------------------

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

posted @ 2015-11-18 11:17 菜鸡一枚阅读(187) 评论(0) 收藏举报

刷新页面返回顶部

菜鸡一枚

7、Implement deep networks for digit classification

公告