早上七点多就起来了,然后去食堂吃饭,充饭卡,回寝室交网费,然后来到图书馆。
首先导入数据,由于octave内存吃紧,每次大概只能导入20000条数据。所以我是将整个train分成6份,每份20000条数据,最后一份5471条数剧。
octave> train_1 = dlmread("train_v2.csv", ",", [1, 1, 20000, 778]);
octave> save("train_1.mat", "train_1")
octave> train_2 = dlmread("train_v2.csv", ",", [20001, 1, 40000, 778]);
octave> save("train_2.mat", "train_2")
octave> train_3 = dlmread("train_v2.csv", ",", [40001, 1, 60000, 778]);
octave> save("train_3.mat", "train_3")
octave> train_4 = dlmread("train_v2.csv", ",", [60001, 1, 80000, 778]);
octave> save("train_4.mat", "train_4")
octave> train_5 = dlmread("train_v2.csv", ",", [80001, 1, 100000, 778]);
octave> save("train_5.mat", "train_5")
octave> train_6 = dlmread("train_v2.csv", ",", [100001, 1, 105471, 778]);
octave> save("train_6.mat", "train_6")
所有数据现在的结果是这样的:
octave> clear all; load("train_1.mat"); size(train_1)
ans =
20000 770
octave> clear all; load("train_2.mat"); size(train_2)
ans =
20000 770
octave> clear all; load("train_3.mat"); size(train_3)
ans =
20000 770
octave> clear all; load("train_4.mat"); size(train_4)
ans =
20000 770
octave> clear all; load("train_5.mat"); size(train_5)
ans =
20000 770
octave> clear all; load("train_6.mat"); size(train_6)
ans =
5471 770
注意:这些矩阵的最后一列是label答案,不是特征。
耗费了大量的时间,整个上午都搭在这上面了,然后看test_v2.csv竟然有1G多。
下午写了个函数answerToLabel,把所有答案转换成正反例的标签:
function label = answerToLabel(answer)
label = zeros(size(answer));
for i = 1:length(answer)
if answer(i) > 0
label(i) = true;
else
label(i) = false;
end
end
octave> clear all; load("train_1.mat"); label_1 = answerToLabel(train_1(:, end)); save("train_1.mat", "train_1", "label_1");
octave> clear all; load("train_2.mat"); label_2 = answerToLabel(train_2(:, end)); save("train_2.mat", "train_2", "label_2");
octave> clear all; load("train_3.mat"); label_3 = answerToLabel(train_3(:, end)); save("train_3.mat", "train_3", "label_3");
octave> clear all; load("train_4.mat"); label_4 = answerToLabel(train_4(:, end)); save("train_4.mat", "train_4", "label_4");
octave> clear all; load("train_5.mat"); label_5 = answerToLabel(train_5(:, end)); save("train_5.mat", "train_5", "label_5");
octave> clear all; load("train_6.mat"); label_6 = answerToLabel(train_6(:, end)); save("train_6.mat", "train_6", "label_6");
然后用coursera上面的正规化函数,将特征全部正规化:
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
X_norm = bsxfun(@minus, X, mu);
sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);
end
浙公网安备 33010602011771号