Softmax Regression Review

前言：

今天重新看了andrel ng ufldl上关于softmax regression的另一篇教程，主要内容基本和上一篇基本一致。但在编程实现过程中有一些差异，可能两部分内容代码实现是由不同的人实现的。原来只是想巩固一下的，不过通过不同的代码实现也发现了一些问题。

理论基础：

为了描述方便，下面将原来教程中的方法称为方法1，本节的方法称为方法2

softmax regreesion的参数有一个特性：冗余。即当我们学习到一组参数，如果同时减去一个值，不影响最后的预测值。

在方法1中，作者通过给目标函数增加正则项来解决这个：

而在方法2中，作者在编程中通过设 $\theta_{k} = 0$ ，没有加正则项

实验结果：

方法1准确率： Accuracy: 90.300%

方法2准确率： Accuracy: 87.5%

方法2+正则项： Accuracy: 89.5%

实验分析： 可以看出，加正则项的方法稍微有效一些，这也是深度学习中我们解决些类问题经常采用的方法，如SVM，sparse coding等。正则项中可以通过衰减因子 $\lambda$ 控制误差项与正则项的重要程度，更加灵活；而将 $\theta_{k} = 0$ 的方法比较简单粗暴。

实验主要代码：

方法1代码见作者另一篇博文

方法2代码：

　　stater code见：http://ufldl.stanford.edu/tutorial/StarterCode/

　　ex1c_softmax.m

addpath ../common
addpath ../common/minFunc_2012/minFunc
addpath ../common/minFunc_2012/minFunc/compiled

% Load the MNIST data for this exercise.
% train.X and test.X will contain the training and testing images.
%   Each matrix has size [n,m] where:
%      m is the number of examples.
%      n is the number of pixels in each image.
% train.y and test.y will contain the corresponding labels (0 to 9).
binary_digits = false;
num_classes = 10;
[train,test] = ex1_load_mnist(binary_digits);

% Add row of 1s to the dataset to act as an intercept term.
train.X = [ones(1,size(train.X,2)); train.X]; 
test.X = [ones(1,size(test.X,2)); test.X];
train.y = train.y+1; % make labels 1-based.
test.y = test.y+1; % make labels 1-based.

% Training set info
m=size(train.X,2);
n=size(train.X,1);

% Train softmax classifier using minFunc
options = struct('MaxIter', 200);

% Initialize theta.  We use a matrix where each column corresponds to a class,
% and each row is a classifier coefficient for that class.
% Inside minFunc, theta will be stretched out into a long vector (theta(:)).
% We only use num_classes-1 columns, since the last column is always assumed 0.
theta = rand(n,num_classes-1)*0.001;

% Call minFunc with the softmax_regression_vec.m file as objective.
%
% TODO:  Implement batch softmax regression in the softmax_regression_vec.m
% file using a vectorized implementation.
%
tic;
theta(:)=minFunc(@softmax_regression_vec, theta(:), options, train.X, train.y);
fprintf('Optimization took %f seconds.\n', toc);
theta=[theta, zeros(n,1)]; % expand theta to include the last class.

% Print out training accuracy.
tic;
accuracy = multi_classifier_accuracy(theta,train.X,train.y);
fprintf('Training accuracy: %2.1f%%\n', 100*accuracy);

% Print out test accuracy.
accuracy = multi_classifier_accuracy(theta,test.X,test.y);
fprintf('Test accuracy: %2.1f%%\n', 100*accuracy);


% % for learning curves
% global test
% global train
% test.err{end+1} = multi_classifier_accuracy(theta,test.X,test.y);
% train.err{end+1} = multi_classifier_accuracy(theta,train.X,train.y);

softmax_regression_vec.m

function [f,g] = softmax_regression(theta, X,y)
  %
  % Arguments:
  %   theta - A vector containing the parameter values to optimize.
  %       In minFunc, theta is reshaped to a long vector.  So we need to
  %       resize it to an n-by-(num_classes-1) matrix.
  %       Recall that we assume theta(:,num_classes) = 0.
  %
  %   X - The examples stored in a matrix.  
  %       X(i,j) is the i'th coordinate of the j'th example.
  %   y - The label for each example.  y(j) is the j'th example's label.
  %
  m=size(X,2);
  n=size(X,1);
    
  % theta is a vector;  need to reshape to n x num_classes.
  % last column is all zeros
  theta=reshape(theta, n, []);
  num_classes=size(theta,2)+1;
  
  % initialize objective value and gradient.
  f = 0;
  g = zeros(size(theta));

  %
  % TODO:  Compute the softmax objective function and gradient using vectorized code.
  %        Store the objective function value in 'f', and the gradient in 'g'.
  %        Before returning g, make sure you form it back into a vector with g=g(:);
  %
%%% YOUR CODE HERE %%%
  
lambda = 1e-4;

theta = [ theta  zeros( n,1) ]; % last column is zeros
h = exp( theta' * X );;
p = bsxfun(@rdivide , h , sum(h , 1) ); % normalize
log_p = log( p );
gt = full( sparse( y , 1:m , 1) );
%with weight decay term
% f = -1/m * gt(:)' *  log_p(:) + lambda/2 * sum( theta(:) .^ 2 );
% g = -1/m * X * (gt - p )' + lambda * theta;
%without weight decay term
f = -1/m * gt(:)' *  log_p(:);
g = -1/m * X * (gt - p )'; 
g = g( : , 1:end-1); %  take last column away

  g=g(:); % make gradient a vector for minFunc

参考资料：

http://ufldl.stanford.edu/tutorial/StarterCode/

http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/

http://deeplearning.stanford.edu/wiki/index.php/Softmax_Regression

http://www.cnblogs.com/dupuleng/articles/4118178.html

http://www.cnblogs.com/dupuleng/articles/4118387.html

posted @ 2014-12-18 11:21 dupuleng 阅读(577) 评论(0) 收藏举报

刷新页面返回顶部

dupuleng

Softmax Regression Review

公告