Exercise : Stacked Autoencoder
实验介绍:完成手写体识别,采用MNIST手写数据库,数字从0-9,训练样本6万个,测试样本1万个。每个样本由一个大小为28*28的图片表示
环境:matlab 2010a
实验方法:
使用多层神经网络模型,本节使用两人隐含层,隐含层之间分开学习,依次训练隐含层,将前一个隐含层的输出作为后一个隐含层的输入。每一个隐含层的训练是一个三层的autoencoder模型,为无监督学习。
最后一层为softmax regression模型,训练具有10个类的分类器。然后使用back-propagation模型调整参数(本章称为fine-tuning策略)
理论基础:
深度神经网络的优点:
1、可以指数级的减少神经元的数目,比如我们要使用神经网络模型计算XOR,如果使用单层或深度较浅的神经网络,那么需要指数级的神经元,而如果使用深层神经网络,那么神经元的数目仅为多项式级
2、特征学习过程是:part-whole decompositions。即第一层可能学习到边缘特征,第二层将边缘连起来检测到更长的边缘,更高层次的网络可能会检测到物体的某个部分,比如人的眼睛,鼻子等。
训练深度神经网络的困难:
- availability of data: 神经网络的训练是有监督学习,需要大量的有标签的样本,随着深度的增加,训练样本的需要以指数级增长,否则会出现过拟合现象。然而有标签的样本很难大量的获得。
- local optima: 使用有监督学习训练shallow network可以得到很好的的模型参数,由于解决的是一个高阶的非凸问题,在训练deep network时,这个问题非凸问题成为严重,因此使用梯度下降等参数学习方法不能很好工作
- diffusion of gradients: 在深度神经网络中一般使用back propagation计算梯度,即将输出层的误差依次反馈到前面的隐含层,但是随着深度的增加,梯度以相当快的速度减少,因此损失函数的梯度反馈到最前面的隐含层时已经非常小了,所学到的权重变化也非学小,此时前面的隐含层对于改善整个模型的performance的贡献就非常的小。
greedy layer-wise learning:
该模型也包含多个隐含层,不过每个隐含层是分开学习的,即先学习第一层,然后将第一层的输出作为第二层的输入学习第二层,以此类推,每一层的学习是一个简单三层autoencoder模型,最后一层加上softmax regression作为分类器
该模型的优点:
- availability of data:由于autoencoder是完全的无监督学习,因此只需要无标签的样本,而无标签样本相较于有标签的样本非常容易获得,这也就是seft-taught learning最大的优势。
- better local optima:使用无标签的样本学习到权重参数的更优的初始值 W相对于随机初始化的权值,此时使用fine-tuning策略以W为起点调整权重,显然可以得到更好的最优解,因为这里我们使用了更好的初始值。
一些问题:
1、同样存在前面介绍的内存的问题,所以作者使用了较少的训练样本及迭代次数
2、在fine-tuning阶段,仅对两个隐含层的权重进行调整,具体调整过程如下 :

在步骤2中,将softmax 这一层整体作为第二个隐含层的输出,由于刚开始的时候以为对softmax层的权重也要调整,所以步骤2中的公式没有看明白,softmax层的输出使用max,而隐含层使用sigmoid 函数,因此结果是不一样的,前者是对max求导,后者对sigmoid函数求导。最后才发现原来仅需要对隐含层执行fine-tuning操作。softmax层为什么不需要权重调整呢,是由于softmax层是有监督学习,因为不需要。 在计算第二个隐含层的
时,我们将softmax层当作 一个整体作为它的输出,那么也就很好理解
了,可以比较一下back propagation中的

由于第二层的输出实际是softmax的输入,因此需要考虑softmax层的权重。
3、关于学习到权重的可视化问题,隐含层L1的权重可直接使用Exercise:Sparse Autoencoder 中的display_network显示,该函数把权重W的每一列作为一个n*n的小图显示,也就要求W11的每一列的维度可开方。L2层的权重W12的显示,不能像L1一样将权重直接授给display_network,由于L2经过L1层,因此需要考虑L1层的权重,但是L1层的激活函数是非线性的,也不能将W12*W11作为学习到的特征,但可以使用它将L1,L2学习到的特征进行比较,发现L2学习到的轮廓更加明显,噪声点更少一点。
4、按道理fine-tuning后,可视化后的权重轮廓应该更加明显,但结果却更糟糕,不晓得是为什么
4、一定要根据实验需要从Exercise:Sparse Autoencoder , Exercise: Vectorization , Exercise: Softmax regression, Exercise: Self-taught Learning 章节中把需要相关代码拷过来,作者在做实验的过程中漏掉了一个文件,搞得调了很久也不晓得是什么原因,哎,太粗心大意了。贴一下需要的文件

实验结果:
fine-tuning之前的权重:


fine-tuning之后的权重:

准确率:
Before Finetuning Test Accuracy: 80.750%
After Finetuning Test Accuracy: 87.400%
原谅我的机器内存的问题,只使用了2000个训练样本,隐含层训练时的迭代次数只有50次,所以准确率比较低,但是还是对于说明fine-tuning的重要性还是很有说服力的
实验主要代码
stackedAEExercise.m
%% CS294A/CS294W Stacked Autoencoder Exercise % Instructions % ------------ % % This file contains code that helps you get started on the % sstacked autoencoder exercise. You will need to complete code in % stackedAECost.m % You will also need to have implemented sparseAutoencoderCost.m and % softmaxCost.m from previous exercises. You will need the initializeParameters.m % loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises. % % For the purpose of completing the assignment, you do not need to % change the code in this file. % %%====================================================================== %% STEP 0: Here we provide the relevant parameters values that will % allow your sparse autoencoder to get good filters; you do not need to % change the parameters below. DISPLAY = true; inputSize = 28 * 28; numClasses = 10; hiddenSizeL1 = 200; % Layer 1 Hidden Size hiddenSizeL2 = 200; % Layer 2 Hidden Size sparsityParam = 0.1; % desired average activation of the hidden units. % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", % in the lecture notes). lambda = 3e-3; % weight decay parameter beta = 3; % weight of sparsity penalty term %%====================================================================== %% STEP 1: Load data from the MNIST database % % This loads our training data from the MNIST database files. % Load MNIST database files trainData = loadMNISTImages('train-images-idx3-ubyte'); trainLabels = loadMNISTLabels('train-labels-idx1-ubyte'); trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1 %%====================================================================== %% STEP 2: Train the first sparse autoencoder % This trains the first sparse autoencoder on the unlabelled STL training % images. % If you've correctly implemented sparseAutoencoderCost.m, you don't need % to change anything here. % Randomly initialize the parameters sae1Theta = initializeParameters(hiddenSizeL1, inputSize); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the first layer sparse autoencoder, this layer has % an hidden size of "hiddenSizeL1" % You should store the optimal parameters in sae1OptTheta addpath minFunc/; options = struct; options.Method = 'lbfgs'; options.maxIter = 50; % as the memory limited, the author decrease the iteration time, if you has enough memory, don't hesitate to enlarge it to 400 or more options.display = 'on'; [sae1OptTheta, cost] = minFunc(@(p)sparseAutoencoderCost(p,... inputSize,hiddenSizeL1,lambda,sparsityParam,beta,trainData),sae1Theta,options);%训练出第一层网络的参数 save('saves/step2.mat', 'sae1OptTheta'); % show the learned weights in hidden layer L1 if DISPLAY W1 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize); figure; title('first layer'); display_network(W1'); end % ------------------------------------------------------------------------- %%====================================================================== %% STEP 2: Train the second sparse autoencoder % This trains the second sparse autoencoder on the first autoencoder % featurse. % If you've correctly implemented sparseAutoencoderCost.m, you don't need % to change anything here. [sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ... inputSize, trainData); % Randomly initialize the parameters sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the second layer sparse autoencoder, this layer has % an hidden size of "hiddenSizeL2" and an inputsize of % "hiddenSizeL1" % % You should store the optimal parameters in sae2OptTheta [sae2OptTheta, cost] = minFunc(@(p)sparseAutoencoderCost(p,... hiddenSizeL1,hiddenSizeL2,lambda,sparsityParam,beta,sae1Features),sae2Theta,options);%训练出第一层网络的参数 save('saves/step3.mat', 'sae2OptTheta'); % ------------------------------------------------------------------------- %%====================================================================== %% STEP 3: Train the softmax classifier % This trains the sparse autoencoder on the second autoencoder features. % If you've correctly implemented softmaxCost.m, you don't need % to change anything here. [sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ... hiddenSizeL1, sae1Features); % Randomly initialize the parameters saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1); %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the softmax classifier, the classifier takes in % input of dimension "hiddenSizeL2" corresponding to the % hidden layer size of the 2nd layer. % % You should store the optimal parameters in saeSoftmaxOptTheta % % NOTE: If you used softmaxTrain to complete this part of the exercise, % set saeSoftmaxOptTheta = softmaxModel.optTheta(:); softmaxLambda = 1e-4; numClasses = 10; softoptions = struct; softoptions.maxIter = 400; softmaxModel = softmaxTrain(hiddenSizeL2,numClasses,softmaxLambda,... sae2Features,trainLabels,softoptions); saeSoftmaxOptTheta = softmaxModel.optTheta(:); save('saves/step4.mat', 'saeSoftmaxOptTheta'); % before hiddent layer L2, we perform a nonliear transformation, thus it's % quite complicate to show the weights learned in hidden layer L2 than % L1, the author has not figure it out. but you can display W12*W11 to % refer to the weights learned in L2 to some extent when you comparing the % learned feature in L1 and L2 % ------------------------------------------------------------------------- %%====================================================================== %% STEP 5: Finetune softmax model % Implement the stackedAECost to give the combined cost of the whole model % then run this cell. % Initialize the stack using the parameters learned stack = cell(2,1); %其中的saelOptTheta和sae1ptTheta都是包含了sparse autoencoder的重建层网络权值的 stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ... hiddenSizeL1, inputSize); stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1); stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ... hiddenSizeL2, hiddenSizeL1); stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2); % Initialize the parameters for the deep model [stackparams, netconfig] = stack2params(stack); stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];%stackedAETheta是个向量,为整个网络的参数,包括分类器那部分,且分类器那部分的参数放前面 %% ---------------------- YOUR CODE HERE --------------------------------- % Instructions: Train the deep network, hidden size here refers to the ' % dimension of the input to the classifier, which corresponds % to "hiddenSizeL2". % % [stackedAEOptTheta, cost] = minFunc(@(p)stackedAECost(p,inputSize,hiddenSizeL2,... numClasses, netconfig,lambda, trainData, trainLabels),... stackedAETheta,options);%训练出第一层网络的参数 save('saves/step5.mat', 'stackedAEOptTheta'); if DISPLAY optStack = params2stack(stackedAEOptTheta(hiddenSizeL2*numClasses+1:end), netconfig); W11 = optStack{1}.w; W12 = optStack{2}.w; figure; title('fine tuning: W11' ); display_network( W11' ); figure; title('fine tuning: W12'); display_network( (W12*W11)' ); % TODO(zellyn): figure out how to display a 2-level network % display_network(log(1 ./ (1-W11')) * W12'); end % ------------------------------------------------------------------------- %%====================================================================== %% STEP 6: Test % Instructions: You will need to complete the code in stackedAEPredict.m % before running this part of the code % % Get labelled test images % Note that we apply the same kind of preprocessing as the training set testData = loadMNISTImages('t10k-images-idx3-ubyte'); testLabels = loadMNISTLabels('t10k-labels-idx1-ubyte'); testLabels(testLabels == 0) = 10; % Remap 0 to 10 [pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ... numClasses, netconfig, testData); acc = mean(testLabels(:) == pred(:)); fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100); [pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ... numClasses, netconfig, testData); acc = mean(testLabels(:) == pred(:)); fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100); % Accuracy is the proportion of correctly classified images % The results for our implementation were: % % Before Finetuning Test Accuracy: 87.7% % After Finetuning Test Accuracy: 97.6% % % If your values are too low (accuracy less than 95%), you should check % your code for errors, and make sure you are training on the % entire data set of 60000 28x28 training images % (unless you modified the loading code, this should be the case)
stackedAECost.m
function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ... numClasses, netconfig, ... lambda, data, labels) % stackedAECost: Takes a trained softmaxTheta and a training data set with labels, % and returns cost and gradient using a stacked autoencoder model. Used for % finetuning. % theta: trained weights from the autoencoder % visibleSize: the number of input units % hiddenSize: the number of hidden units *at the 2nd layer* % numClasses: the number of categories % netconfig: the network configuration of the stack % lambda: the weight regularization penalty % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % labels: A vector containing labels, where labels(i) is the label for the % i-th training example %% Unroll softmaxTheta parameter % We first extract the part which compute the softmax gradient softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); % Extract out the "stack" stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); % You will need to compute the following gradients softmaxThetaGrad = zeros(size(softmaxTheta)); stackgrad = cell(size(stack)); for d = 1:numel(stack) stackgrad{d}.w = zeros(size(stack{d}.w)); stackgrad{d}.b = zeros(size(stack{d}.b)); end cost = 0; % You need to compute this % You might find these variables useful M = size(data, 2); groundTruth = full(sparse(labels, 1:M, 1)); %% --------------------------- YOUR CODE HERE ----------------------------- % Instructions: Compute the cost function and gradient vector for % the stacked autoencoder. % % You are given a stack variable which is a cell-array of % the weights and biases for every layer. In particular, you % can refer to the weights of Layer d, using stack{d}.w and % the biases using stack{d}.b . To get the total number of % layers, you can use numel(stack). % % The last layer of the network is connected to the softmax % classification layer, softmaxTheta. % % You should compute the gradients for the softmaxTheta, % storing that in softmaxThetaGrad. Similarly, you should % compute the gradients for each layer in the stack, storing % the gradients in stackgrad{d}.w and stackgrad{d}.b % Note that the size of the matrices in stackgrad should % match exactly that of the size of the matrices in stack. % % forward propagation depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end M = softmaxTheta * a{depth+1}; M = bsxfun(@minus, M, max(M)); p = bsxfun(@rdivide, exp(M), sum(exp(M))); numCases = size(data, 2);; cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2); softmaxThetaGrad = -1/numCases * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta; %back propagation d = cell(depth+1); % we take the output of softmax layer as the output of the hiddenLayer L2, d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1}); for layer = (depth:-1:2) d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer}); end for layer = (depth:-1:1) stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}'; stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2); end % ------------------------------------------------------------------------- %% Roll gradient vector grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)]; end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end
stackedAEPredict.m
function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data) % stackedAEPredict: Takes a trained theta and a test data set, % and returns the predicted labels for each example. % theta: trained weights from the autoencoder % visibleSize: the number of input units % hiddenSize: the number of hidden units *at the 2nd layer* % numClasses: the number of categories % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % Your code should produce the prediction matrix % pred, where pred(i) is argmax_c P(y(c) | x(i)). %% Unroll theta parameter % We first extract the part which compute the softmax gradient softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); % Extract out the "stack" stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute pred using theta assuming that the labels start % from 1. % input is the first layer, hidden layer L1 and L2 are the second and third % layer respectively depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end [~, pred] = max(softmaxTheta * a{depth+1});%use softmaxTheta to predict % ----------------------------------------------------------- end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end

浙公网安备 33010602011771号