mahout0.6进行贝叶斯分类

mahout trainclassifier方法参数详情:

mahout testclassifier方法参数详情：

本人亲测实验步骤：

默认文档需要格式转换，以20news-bydate.tar.gz数据包为例，解压缩

http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

1.要对输入数据train进行预处理，

[root@mlj ~]# mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /usr/local/20news-bydate/20news-bydate-train -o /usr/local/20news-bydate/bayesoutput/train -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

转化成Bayes M/R job读入数据要求的格式，即训练器输入的数据是KeyValueTextInputFormat格式，第一个字符是类标签，剩余的是特征属性（即单词）。

以20个新闻的例子来说，从官网上下载的原始数据是一个分类目录，下面每个文件夹名就是类标签，里面是属于此类的一些文档（一个文件是一篇文章）。

mahout通过org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups对数据进行预处理，完成的工作是对于原始的一个分类目录下面的所有文件，依次遍历，并将目录名作为类别名，就这样完成了inputDir－－>outputFile的转变。如果要处理html文件的话，那么就需要在BayesFileFormatter调用html clean，extract body text的过程，生成cleaned text。

2.要对输入数据test进行预处理

[root@mlj ~]# mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /usr/local/20news-bydate/20news-bydate-test -o /usr/local/20news-bydate/bayesoutput/test -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

3.上传处理过的文件到HDFS

[root@mlj bayesoutput]# hadoop fs -put train /naivebayes/
[root@mlj bayesoutput]# hadoop fs -put test /naivebayes/
[root@mlj bayesoutput]# hadoop fs -ls /naivebayes
Found 2 items
drwxr-xr-x - root supergroup 0 2015-04-23 01:12 /naivebayes/test
drwxr-xr-x - root supergroup 0 2015-04-23 01:12 /naivebayes/train

4.训练分类模型

15/04/23 01:15:51 INFO driver.MahoutDriver: Program took 211 ms (Minutes: 0.0035166666666666666)
[root@mlj bayesoutput]# mahout trainclassifier -i /naivebayes/train -o /naivebayes/model -type bayes -ng 3 -source hdfs

5.测试训练模型

[root@mlj bayesoutput]# mahout testclassifier -m newsmodel -d /naivebayes/test/ -type bayes -ng 3 -source hdfs -method mapreduce

（自定义的输入格式）：

1.数据集按照1:4分开，1作为测试集，4作为训练集：

grunt> processed = load 'hdfs://mlj:9000/out/processed' as (category:chararray,doc:chararray);
grunt> test = sample processed 0.2;
grunt> jnt = join processed by (category,doc) left outer, test by (category,doc);
grunt> filt_test = filter jnt by test::category is null;
grunt> train = foreach filt_test generate processed::category as category,processed::doc as doc;
grunt> store test into 'out/test';
grunt> store train into 'out/train';
grunt> train_ct = foreach (group train by category) generate group,COUNT(train.category);
grunt> test_ct = foreach (group test by category) generate group,COUNT(test.category);
grunt> dump train_ct;
grunt> dump test_ct;

2.训练模型：

mahout trainclassifier -i /nb/train -o /nb/model -type bayes -ng 1 -source hdfs

3.测试模型：

mahout testclassifier -d /nb/test -m /nb/model -type bayes -ng 1 -source hdfs -method mapreduce

----------------------------------------------------------------------------------------------------------

另外一种方法：

序列化：mahout seqdirectory -i /naive/news -o /naive/news-seq
向量化：mahout seq2sparse -i /naive/news-seq -o /naive/news-vector -lnorm -nv -wt tfidf
mahout split -i /naive/news-vector/tfidf-vectors -tr /naive/news-train-vector -te /naive/news-test-vector -rp 20 -ow -seq -xm sequential
训练：mahout trainnb -i /naive/news-train-vector -el -o /naive/news-model -li /naive/news-labindex -ow -c
测试：mahout testnb -i /naive/news-train-vector -m /naive/news-model -l /naive/news-labindex -ow -o /naive/news-test -c

posted @ 2015-04-21 22:32 孟想阳光阅读(357) 评论(0) 收藏举报

刷新页面返回顶部

孟想阳光

mahout0.6进行贝叶斯分类

2.要对输入数据test进行预处理

公告