EX8：异常检测与推荐系统的练习

EX8 异常检测与推荐系统的练习
- - 1.异常检测-Anomaly detection
  - 2.推荐系统

EX8 异常检测与推荐系统的练习

在本练习中，首先将异常检测算法应用于检测网络中的故障服务器。在第二部分中，将使用协作过滤来构建电影推荐系统。

1.异常检测-Anomaly detection

在本节练习中，将实施异常检测算法来检测服务器计算机中的异常行为。该功能测量每个服务器的响应的吞吐量-throughput（mb / s）和延迟-latency（ms）。当服务器正在运行时，共收集了m = 307个例子，说明它们的行为方式，因此有一个未标记的数据集{$x^{(1)},...,x^{(m)}$}。这里绝大多数是正常运行的（非异常）示例，但也可能存在一些在此数据集中异常运行的服务器示例。

我们将使用高斯模型来检测数据集中的异常示例。首先从2D数据集开始，因为它可以更方便地可视化算法正在做什么。在该数据集上，高斯分布的模型是适合的，然后找到具有非常低概率的值，从而可以将其视为异常。之后，我们会将异常检测算法应用于具有许多维度的较大数据集。

首先，可视化数据集如下：

![1.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fi2bey0qi3j20bz09cjrg.jpg)

#### 1.1高斯分布-Gaussian distribution

为了实行异常检测，我们首先需要拟合适合数据集分布的模型，给定训练集{$x^{(1)},...,x^{(m)}$}（其中$x^{(i)}\in \mathbb{R}^n$），要估计每个特征x_i的高斯分布。对于每个特征i = 1,...,n，需要找到拟合第i维数据的参数$\mu_i$和$\sigma_i^2$。高斯分布公式如下：（其中$\mu$为均值，$\sigma^2$表示方差）

![2.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fi2bey9d9lj208h023a9y.jpg)

#### 1.2估计高斯参数

我们可以使用下面的式子来估计参数（$\mu_i,\sigma_i^2$），如为了估计均值有：$\mu_i=\frac{1}{m}\Sigma_{j=1}^mx_i^{(j)}$，为了估计方差有：$\sigma_i^2=\frac{1}{m}\Sigma_{j=1}^m(x_i^{(j)}-\mu_i)^2$。

完成estimateGaussian.m的代码，该函数将数据矩阵X作为输入，并输出保持所有n个特征的平均值的n维向量mu，以及保存所有特征的方差的n维向量sigma2。代码如下：

% Useful variables
[m, n] = size(X);

% You should return these values correctly
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the mean of the data and the variances
%               In particular, mu(i) should contain the mean of
%               the data for the i-th feature and sigma2(i)
%               should contain variance of the i-th feature.
%
for i=1:n
    mu(i)=1/m*sum(X(:,i));
end

for i=1:n
    for j=1:m
        sigma2(i) = sigma2(i)+(X(j,i)-mu(i))^2;
    end
    sigma2(i) = sigma2(i)/m;
end

运行后如下图，可以看出，大部分的例子都是在最高概率的地区，而异常的例子是在概率较低的地区。

![3.jpg](http://wx4.sinaimg.cn/mw690/7b8d2108gy1fi2beyu8g8j20cg09c3z7.jpg)

#### 1.3选择限值

现在已经估计了高斯参数，可以调查哪些示例在给定此分布的情况下具有非常高的概率，哪些示例的概率非常低，低概率的例子更有可能是我们数据集中的异常现象。确定哪些示例是异常的一种方法是基于交叉验证集来选择阈值。在这部分练习中，将使用交叉验证集上的F1分数来实施一个算法以选择阈值。

完成代码selectThreshold.m。我们将使用交叉验证集{$(x_{cv}^{(1)},y_{cv}^{(1)}),..,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})$}，其中标签y = 1对应于异常示例，并且y = 0对应于正常示例。对于每个交叉验证示例，我们将计算$p(x_{cv}^{(i)})$。所有这些概率$p(x_{cv}^{(1)}),p(x_{cv}^{(2)}),...,p(x_{cv}^{(m_{cv})})$以向量的形式被传递给向量pval中，相应的标签$y_{cv}^{(1)},y_{cv}^{(2)},...,y_{cv}^{(m_{cv})}$被传递给向量yval中。

函数selectThreshold.m应该返回两个值; 第一个是所选择的阈值$\varepsilon $ ，如果一个例子x具有低概率$P(x)<\varepsilon $，则被认为是异常。第二个应该返回F1分数，在给定一定的阈值时，通过计算当前阈值正确和不正确地分类样本以计算得到的F1分数，它说明在寻找异常方面做得效果如何。

F_1的计算用到了预测（prec）和召回（rec），具体公式为：$F_1=\frac{2prec*rec}{prec+rec}$，对预测和召回值得计算按照：$prec = \frac{tp}{tp+fp}$和$rec = \frac{tp}{tp+fn}$ 。

tp是真阳的数量(true positive)：标签说这是一个异常同时我们的算法将其也正确地分类为异常。

fp是假阳-误报的数量(false positive)：标签说这不是一个异常，但我们的算法将其分类为异常。

fn是假阴-漏报的数量(false negative)：标签说是一个异常，而我们的算法将其分类为正常。

关于fn和fp的理解

函数selectThreshold.m代码如下：

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
    
    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the F1 score of choosing epsilon as the
    %               threshold and place the value in F1. The code at the
    %               end of the loop will compare the F1 score for this
    %               choice of epsilon and set it to be the best epsilon if
    %               it is better than the current choice of epsilon.
    %               
    % Note: You can use predictions = (pval < epsilon) to get a binary vector
    %       of 0's and 1's of the outlier predictions

    cvPredictions = (pval < epsilon);
    tp = sum((cvPredictions == 1) & (yval == 1));
	fp = sum((cvPredictions == 1) & (yval == 0));
    fn = sum((cvPredictions == 0) & (yval == 1));

    prec = tp/(tp+fp);
    rec = tp/(tp+fn);

    F1 = (2*prec*rec)/(prec+rec);
    % =============================================================

    if F1 > bestF1
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

运行后，最佳的epsilon值为8.99e-05，最大的F1值为0.875，选定最佳epsilon后识别异常过的图形如下：

![4.jpg](http://wx3.sinaimg.cn/mw690/7b8d2108gy1fi2bezemh9j20cg09cmy2.jpg)

这里其实还需要补充一点，对于主程序中有以下代码段：

%  通过数据集估计参数
[mu sigma2] = estimateGaussian(X);

%  训练训练集模型
p = multivariateGaussian(X, mu, sigma2);

%  训练交叉验证集模型（用来选择最终的epsilon）
pval = multivariateGaussian(Xval, mu, sigma2);

%  Find the best threshold
[epsilon F1] = selectThreshold(yval, pval);

其中multivariateGaussian函数的定义如下：

function p = multivariateGaussian(X, mu, Sigma2)

n = length(mu);

%Sigma2若为列向量或行向量时，将其变换为对角阵
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end

X = bsxfun(@minus, X, mu(:)');%  X := X-mu
p = (2 * pi) ^ (- n / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));%det 取行列式  times 乘法

end

给出相应公式作为对照：

![5.jpg](http://wx2.sinaimg.cn/mw690/7b8d2108gy1fhxanuwvhaj20fl01mjrb.jpg?_=7217215)

2.推荐系统

在本部分的练习中，将通过协同过滤学习算法并将其应用于电影评级数据集，该数据集由1到5的等级组成。数据集具有n_u = 943个用户，n_m = 1682部电影。练习的下一部分，将实现计算协同拟合目标函数和渐变的函数cofiCostFunc.m。实现成本函数和渐变后，将使用fmincg.m来学习协作过滤的参数。

2.1电影评分数据集

脚本ex8 cofi.m的第一部分将加载数据集ex8 movies.mat，从而提供变量Y和R。矩阵Y（num movies×num_users尺寸）存储所评分的等级$y^{(i,j)}$（从1到5）。矩阵R是二进制值指示符矩阵，其中如果用户j对电影i给出评级，则$R(i,j)=1$，否则为$R(i,j)=0$。

协同过滤算法的目的是预测用户尚未评估的电影的电影评级，即$R(i,j)=0$的项。从而推荐具有最高预测等级的电影给用户。

为了更好的了解矩阵Y，脚本ex8_cofi.m将计算第一部电影（Toy Story）的平均影片评分，并将平均评分显示到屏幕。在整个练习过程中，将使用矩阵X和Theta：

![6.jpg](http://wx1.sinaimg.cn/mw690/7b8d2108gy1fi2bezrwrnj209m02igli.jpg)

X矩阵的每一行表示第i部电影的特征$x^{(i)}$，Theta矩阵的每一行表示第j个用户的参数向量$\theta^{(j)}$。这里我们设定n=100，即X为n_m*100，Theta为n_u*100。

2.2协同滤波学习算法

协同过滤算法考虑了一组n维参数向量$x^{(1)},...,x^{(n_m)}$和$\theta^{(1)},...,\theta^{(n_u)}$，其中模型将用户j对电影i的评级预测为$y^{(i,j)}=(\theta^{(j)})^Tx^{(i)}$。给定一些由一些用户在某些电影上产生的评分组成的数据集，希望学习参数向量$x^{(1)},...,x^{(n_m)}$和$\theta^{(1)},...,\theta^{(n_u)}$以产生最佳的拟合效果（最小化平方误差）。

函数cofiCostFunc.m中的代码用来计算协同过滤的代价函数和梯度。函数的参数（即尝试学习的值）是X和Theta。为了使用诸如fmincg的现成最小化库，代价函数已被设置为将参数展开为单个向量参数。

2.2.1协同滤波的代价函数

不包含正则化下的代价函数计算公式如下：

![1.jpg](http://wx4.sinaimg.cn/mw690/7b8d2108gy1fi31nftrskj20ey01ut8o.jpg)

函数cofiCostFunc.m 代码如下：

% 主程序中的调用方式
% J = cofiCostFunc([X(:) ; Theta(:)], Y, R, num_users, num_movies, ...
%               num_features, 0);
% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
                num_users, num_features);
                
J = sum(sum(((X*Theta').*R-Y).^2))/2;

2.2.2 协同滤波的梯度

现在应该完成cofiCostFunc.m中的梯度部分代码，该函数能够返回X_grad和Theta_grad两个参数，当然X_grad和Theta_grad得尺寸与X和Theta的尺寸相同。代价函数的梯度计算工式如下：

![2.jpg](http://wx1.sinaimg.cn/mw690/7b8d2108gy1fi31ng5gftj20af044aa9.jpg)

请注意，该函数通过将它们展开为单个向量来返回变量集的梯度。在计算梯度后，脚本ex8_cofi.m将运行梯度校验（checkCostFunction）以数字形式检查梯度的实现。如果实现正确，会发现分析和数值梯度紧密配合。

无正则化的cofiCostFunc.m代码如下：

% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
                num_users, num_features);
            
% You need to return the following values correctly
J = 0;
X_grad = zeros(size(X));
Theta_grad = zeros(size(Theta));

for i=1:num_movies
    idx = find(R(i,:)==1);
    Theta_temp = Theta(idx,:);
    Y_temp = Y(i,idx);
    X_grad(i,:)= (X(i,:)*Theta_temp'-Y_temp)*Theta_temp;
end

%调试中要搞清楚矩阵Y、R行列表示的具体含义，X、Theta亦然，num_users和num_movies调试值可为4*5
for i=1:num_users
    idx = find(R(:,i)==1);
    X_temp = X(idx,:);
    Y_temp = Y(idx,i);
    Theta_grad(i,:) = ((X_temp*Theta(i,:)'-Y_temp))'*X_temp;
end

梯度校验（checkCostFunction）代码段如下：

% Set lambda
if ~exist('lambda', 'var') || isempty(lambda)
    lambda = 0;
end

%% Create small problem
X_t = rand(4, 3);
Theta_t = rand(5, 3);

% Zap out most entries
Y = X_t * Theta_t';
Y(rand(size(Y)) > 0.5) = 0;
R = zeros(size(Y));
R(Y ~= 0) = 1;

%% Run Gradient Checking
X = randn(size(X_t));
Theta = randn(size(Theta_t));
num_users = size(Y, 2);
num_movies = size(Y, 1);
num_features = size(Theta_t, 2);

numgrad = computeNumericalGradient( ...
                @(t) cofiCostFunc(t, Y, R, num_users, num_movies, ...
                                num_features, lambda), [X(:); Theta(:)]);

[cost, grad] = cofiCostFunc([X(:); Theta(:)],  Y, R, num_users, ...
                          num_movies, num_features, lambda);

disp([numgrad grad]);
fprintf(['The above two columns you get should be very similar.\n' ...
         '(Left-Your Numerical Gradient, Right-Analytical Gradient)\n\n']);

diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf(['If your cost function implementation is correct, then \n' ...
         'the relative difference will be small (less than 1e-9). \n' ...
         '\nRelative Difference: %g\n'], diff);

%数值上计算梯度函数computeNumericalGradient代码段如下:
function numgrad = computeNumericalGradient(J, theta)

numgrad = zeros(size(theta));
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
    % Set perturbation vector
    perturb(p) = e;
    loss1 = J(theta - perturb);
    loss2 = J(theta + perturb);
    % Compute Numerical Gradient
    numgrad(p) = (loss2 - loss1) / (2*e);
    perturb(p) = 0;
end

end

2.2.3正则化代价函数

含有正则化参数的协同滤波代价函数公式如下：

![3.jpg](http://wx1.sinaimg.cn/mw690/7b8d2108gy1fi31ngnmn2j20hd045mxd.jpg)

完成后的cofiCostFunc.m代价函数部分代码为：

J = sum(sum(((X*Theta').*R-Y).^2))/2;
J = J + sum(sum(Theta.^2))*lambda/2 + sum(sum(X.^2))*lambda/2

2.2.4正则化梯度下降

对含有正则化代价函数的梯度运算公式为：

![4.jpg](http://wx4.sinaimg.cn/mw690/7b8d2108gy1fi31nh1k3pj20ce046mxf.jpg)

完成后的cofiCostFunc.m梯度下降部分完整代码为：

for i=1:num_movies
    idx = find(R(i,:)==1);
    Theta_temp = Theta(idx,:);
    Y_temp = Y(i,idx);
    X_grad(i,:)= (X(i,:)*Theta_temp'-Y_temp)*Theta_temp;
    X_grad(i,:) = X_grad(i,:)+lambda*X(i,:);
end

for j=1:num_users
    idx = find(R(:,j)==1);
    X_temp = X(idx,:);
    Y_temp = Y(idx,j);
    Theta_grad(j,:) = ((X_temp*Theta(j,:)'-Y_temp))'*X_temp;
    Theta_grad(j,:) = Theta_grad(j,:) + lambda*Theta(j,:);
end
% =============================================================

grad = [X_grad(:); Theta_grad(:)];

2.3 学习推荐电影

在ex8 cofi.m脚本的最后一部分，可以输入自己喜爱的电影，便于算法运行时，可以获得系统推荐给自己的电影！所有电影的列表及其在数据集中的编号可以在文件movie idx.txt中找到。

movie idx.txt部分内容如下：

1 Toy Story (1995)
2 GoldenEye (1995)
3 Four Rooms (1995)
4 Get Shorty (1995)
5 Copycat (1995)
6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
7 Twelve Monkeys (1995)
8 Babe (1995)
9 Dead Man Walking (1995)
10 Richard III (1995)
11 Seven (Se7en) (1995)
12 Usual Suspects, The (1995)
13 Mighty Aphrodite (1995)
14 Postino, Il (1994)
15 Mr. Holland's Opus (1995)

读取该内容的完整程序为：

% movieList = loadMovieList();

function movieList = loadMovieList()
%	GETMOVIELIST reads the fixed movie list in movie.txt and returns a
%	cell array of the words
%   movieList = GETMOVIELIST() reads the fixed movie list in movie.txt 
%   and returns a cell array of the words in movieList.

%% Read the fixed movieulary list
fid = fopen('movie_ids.txt');

% Store all movies in cell array movie{}
n = 1682;  % Total number of movies 

movieList = cell(n, 1);
for i = 1:n
    % Read line
    line = fgets(fid);%fgets函数用于读取文件中指定一行，并保留换行符，与之前的fopen配套使用
    % Word Index (can ignore since it will be = i)
    [idx, movieName] = strtok(line, ' ');%按“ ”分类捕捉索引和字符串电影名变量
    % Actual Word
    movieList{i} = strtrim(movieName);%此函数返回字符串str的副本并删除了所有前导和尾随空格字符
end
fclose(fid);

end

在额外的额定值被添加到数据集之后，脚本将继续训练协同过滤模型，具体添加过程如下：

%  初始化个人观影评分
my_ratings = zeros(1682, 1);

% Check the file movie_idx.txt for id of each movie in our dataset
% For example, Toy Story (1995) has ID 1, so to rate it "4", you can set
my_ratings(1) = 4;

% Or suppose did not enjoy Silence of the Lambs (1991), you can set
my_ratings(98) = 2;

% We have selected a few movies we liked / did not like and the ratings we
% gave are as follows:
my_ratings(7) = 3;
my_ratings(12)= 5;
my_ratings(54) = 4;
my_ratings(64)= 5;
my_ratings(66)= 3;
my_ratings(69) = 5;
my_ratings(183) = 4;
my_ratings(226) = 5;
my_ratings(355)= 5;

此后，算法将同过调用训练集学习参数X和Theta，从而预测电影i对用户j的评级$(\theta^{(j)})^Tx^{(i)}$。脚本的下一部分计算所有电影和用户的评分，根据脚本中上面输入的评分，进而显示它推荐的电影。

Top recommendations for you:
Predicting rating 5.0 for movie Star Kid (1997)
Predicting rating 5.0 for movie Prefontaine (1997)
Predicting rating 5.0 for movie Great Day in Harlem, A (1994)
Predicting rating 5.0 for movie Someone Else's America (1995)
Predicting rating 5.0 for movie Marlene Dietrich: Shadow and Light (1996)
Predicting rating 5.0 for movie Aiqing wansui (1994)
Predicting rating 5.0 for movie Entertaining Angels: The Dorothy Day Story (1996)
Predicting rating 5.0 for movie Santa with Muscles (1996)
Predicting rating 5.0 for movie They Made Me a Criminal (1939)
Predicting rating 5.0 for movie Saint of Fort Washington, The (1993)

posted @ 2017-07-30 23:16 SrtFrmGNU 阅读(628) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Luke's Blog.

欲穷千里目，事后拂衣归