mahout0.6进行贝叶斯分类

mahout trainclassifier方法参数详情:

mahout testclassifier方法参数详情:

本人亲测实验步骤

默认文档需要格式转换,以20news-bydate.tar.gz数据包为例,解压缩

http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

1.要对输入数据train进行预处理,

 

[root@mlj ~]# mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /usr/local/20news-bydate/20news-bydate-train -o /usr/local/20news-bydate/bayesoutput/train -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

 

转化成Bayes M/R job读入数据要求的格式,即训练器输入的数据是KeyValueTextInputFormat格式,第一个字符是类标签,剩余的是特征属性(即单词)。

以20个新闻的例子来说,从官网上下载的原始数据是一个分类目录,下面每个文件夹名就是类标签,里面是属于此类的一些文档(一个文件是一篇文章)。

mahout通过org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups对数据进行预处理,完成的工作是对于原始的一个分类目录下面的所有文件,依次遍历,并将目录名作为类别名,就这样完成了inputDir-->outputFile的转变。如果要处理html文件的话,那么就需要在BayesFileFormatter调用html clean,extract body text的过程,生成cleaned text。 

2.要对输入数据test进行预处理

[root@mlj ~]# mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /usr/local/20news-bydate/20news-bydate-test -o /usr/local/20news-bydate/bayesoutput/test -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

3.上传处理过的文件到HDFS

[root@mlj bayesoutput]# hadoop fs -put train /naivebayes/
[root@mlj bayesoutput]# hadoop fs -put test /naivebayes/
[root@mlj bayesoutput]# hadoop fs -ls /naivebayes
Found 2 items
drwxr-xr-x - root supergroup 0 2015-04-23 01:12 /naivebayes/test
drwxr-xr-x - root supergroup 0 2015-04-23 01:12 /naivebayes/train

4.训练分类模型

15/04/23 01:15:51 INFO driver.MahoutDriver: Program took 211 ms (Minutes: 0.0035166666666666666)
[root@mlj bayesoutput]# mahout trainclassifier -i /naivebayes/train -o /naivebayes/model -type bayes -ng 3 -source hdfs

5.测试训练模型

 

[root@mlj bayesoutput]# mahout testclassifier -m newsmodel -d /naivebayes/test/ -type bayes -ng 3 -source hdfs -method mapreduce

 

 

(自定义的输入格式):

1.数据集按照1:4分开,1作为测试集,4作为训练集:

 

grunt> processed = load 'hdfs://mlj:9000/out/processed' as (category:chararray,doc:chararray);
grunt> test = sample processed 0.2;
grunt> jnt = join processed by (category,doc) left outer, test by (category,doc);
grunt> filt_test = filter jnt by test::category is null;
grunt> train = foreach filt_test generate processed::category as category,processed::doc as doc;
grunt> store test into 'out/test';
grunt> store train into 'out/train';
grunt> train_ct = foreach (group train by category) generate group,COUNT(train.category);
grunt> test_ct = foreach (group test by category) generate group,COUNT(test.category);
grunt> dump train_ct;
grunt> dump test_ct;

2.训练模型:

mahout trainclassifier -i /nb/train -o /nb/model -type bayes -ng 1 -source hdfs

3.测试模型:

mahout testclassifier -d /nb/test -m /nb/model -type bayes -ng 1 -source hdfs -method mapreduce

----------------------------------------------------------------------------------------------------------

另外一种方法:

序列化:mahout seqdirectory -i /naive/news -o /naive/news-seq
向量化:mahout seq2sparse -i /naive/news-seq -o /naive/news-vector -lnorm -nv -wt tfidf
            mahout split -i /naive/news-vector/tfidf-vectors -tr /naive/news-train-vector -te /naive/news-test-vector -rp 20 -ow -seq -xm sequential
训练:mahout trainnb -i /naive/news-train-vector -el -o /naive/news-model -li /naive/news-labindex -ow -c
测试:mahout testnb -i /naive/news-train-vector -m /naive/news-model -l /naive/news-labindex -ow -o /naive/news-test -c

posted @ 2015-04-21 22:32  孟想阳光  阅读(357)  评论(0)    收藏  举报