2014.2.13 - 贷款预测第二天

Posted on 2014-02-18 16:57  SnakeHunt2012  阅读(213)  评论(0)    收藏  举报

早上七点多就起来了,然后去食堂吃饭,充饭卡,回寝室交网费,然后来到图书馆。

首先导入数据,由于octave内存吃紧,每次大概只能导入20000条数据。所以我是将整个train分成6份,每份20000条数据,最后一份5471条数剧。

octave> train_1 = dlmread("train_v2.csv", ",", [1, 1, 20000, 778]);
octave> save("train_1.mat", "train_1")
octave> train_2 = dlmread("train_v2.csv", ",", [20001, 1, 40000, 778]);
octave> save("train_2.mat", "train_2")
octave> train_3 = dlmread("train_v2.csv", ",", [40001, 1, 60000, 778]);
octave> save("train_3.mat", "train_3")
octave> train_4 = dlmread("train_v2.csv", ",", [60001, 1, 80000, 778]);
octave> save("train_4.mat", "train_4")
octave> train_5 = dlmread("train_v2.csv", ",", [80001, 1, 100000, 778]);
octave> save("train_5.mat", "train_5")
octave> train_6 = dlmread("train_v2.csv", ",", [100001, 1, 105471, 778]);
octave> save("train_6.mat", "train_6")

所有数据现在的结果是这样的:

octave> clear all; load("train_1.mat"); size(train_1)
ans =

   20000     770

octave> clear all; load("train_2.mat"); size(train_2)
ans =

   20000     770

octave> clear all; load("train_3.mat"); size(train_3)
ans =

   20000     770

octave> clear all; load("train_4.mat"); size(train_4)
ans =

   20000     770

octave> clear all; load("train_5.mat"); size(train_5)
ans =

   20000     770

octave> clear all; load("train_6.mat"); size(train_6)
ans =

   5471    770

注意:这些矩阵的最后一列是label答案,不是特征。

耗费了大量的时间,整个上午都搭在这上面了,然后看test_v2.csv竟然有1G多。

下午写了个函数answerToLabel,把所有答案转换成正反例的标签:

function label = answerToLabel(answer)
  label = zeros(size(answer));
  for i = 1:length(answer)
    if answer(i) > 0
      label(i) = true;
    else
      label(i) = false;
    end
end


octave> clear all; load("train_1.mat"); label_1 = answerToLabel(train_1(:, end)); save("train_1.mat", "train_1", "label_1");
octave> clear all; load("train_2.mat"); label_2 = answerToLabel(train_2(:, end)); save("train_2.mat", "train_2", "label_2");
octave> clear all; load("train_3.mat"); label_3 = answerToLabel(train_3(:, end)); save("train_3.mat", "train_3", "label_3");
octave> clear all; load("train_4.mat"); label_4 = answerToLabel(train_4(:, end)); save("train_4.mat", "train_4", "label_4");
octave> clear all; load("train_5.mat"); label_5 = answerToLabel(train_5(:, end)); save("train_5.mat", "train_5", "label_5");
octave> clear all; load("train_6.mat"); label_6 = answerToLabel(train_6(:, end)); save("train_6.mat", "train_6", "label_6");


然后用coursera上面的正规化函数,将特征全部正规化:

function [X_norm, mu, sigma] = featureNormalize(X)

  mu = mean(X);
  X_norm = bsxfun(@minus, X, mu);

  sigma = std(X_norm);
  X_norm = bsxfun(@rdivide, X_norm, sigma);

end