BOW使用指南
先规范一下发间:bow的韵音同low而不是cow。
bow包含三个项目:rainbow用于文本分类;arrow用于文本检索;crossbow用于文本聚类。这三个程序是独立的。
Rainbow
使用rainbow前首先要建立原始文档的一个model----包含了原始文档的一些统计信息,使用rainbow命令时通过-d选项来指定model的路径。
rainbow -d ~/model --index ~/20_newsgroups/*
以上命令是为 20_newsgroups所有分类创建model,生成~/model文件。
--index目录可以分别写:rainbow -d ~/model --index ~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.misc
--index可简写为-i
rainbow不支持一个文档拥有多个类标签。
各个文档属于哪个类都已经包含在了model中。
rainbow -d model --print-doc-names 打印出model中包含的所有文件的文件名(包括完整路径)。
默认情况下rainbow在建立model把字母都转换成了小写,并去除了停用词。
当然用rainbow建立model时还有很多选项可以指定,比如--skip-html可以路过“<"和“>"之间的所有字符;--skip-headers (or -h)选项跳过新闻组或邮件的headers before beginning tokenization.
为原始文档建立好索引后就可以来进行分类了。
rainbow -d ~/model --test-set=0.4 --test=3
表示输出3次试验的结果,60%的文档作为训练集,剩下40%作为测试集。
输出类似于:
/home/mccallum/20_newsgroups/talk.politics.misc/178939 talk.politics.misc talk.politics.misc:0.98 talk.politics.mideast:0.015 talk.politics.guns:0.005
指出了一个文档属于各个类的概率。
bow路径下还有一个perl脚本文件--rainbow-stats,它的输入是以上分类命令的输出,它的输出是平均精度、标准差和混淆矩阵。
rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-stats
进行2次trail,输出形如:
Trial 1
Correct: 1077 out of 1200 (89.75 percent accuracy)
- Confusion details, row is actual, column is predicted
classname 0 1 2 :total
0 talk.politics.guns 378 2 20 :400 94.50%
1 talk.politics.mideast 7 374 19 :400 93.50%
2 talk.politics.misc 57 18 325 :400 81.25%
Percent_Accuracy average 90.38 stderr 0.44
--test-set选项也可以为整数,如--test-set=30,将试图从集合中随机选取30个文档作为测试集,并尽可能保证我30个文档平均分布于各个类中。
--test-set=200pc表示每个分类中随机选取200个文档放入测试集。
你甚至可以具体指定哪个文件放在测试集中--test-set=file1 fil2 各个文件用空格分开。
同理可以指定train-set:rainbow -d ~/model --test-set=file1 --train-set=file2 --test=1
一个文件不能同时出现在--teat-set和--train-set中。
默认情况下不在--test-set中的文档都在--train-set中,我们也可以反过来指定:
rainbow -d ~/model --train-set=1pc --test-set=remaining --test=1
如果测试集不在model中,也可以另外指定:rainbow -d ~/model --test-files ~/more-talk.politics/*
特征项选择
--prune-vocab-by-infogain=N (-T)计算每个词的average mutual information,选取互信息最高的N个词作为特征词。默认情况下N=0,即选择所有的词作为特征词。
--prune-vocab-by-doc-count=N (-D)一个词的文档频率(即在多少个文档中出现)大于N时才把它作为特征词。
--prune-vocab-by-occur-count=N (-O)一个词出现不少于N次时选为特征词。
比如rainbow -d ~/model --prune-vocab-by-infogain=50 --test=1
你还可以查处哪50个词选为了特征词,通过:rainbow -d ~/model -I 50
选择采用的分类方法:rainbow -d ~/model --method=tfidf --test=1
rainbow支持的分类方法有:naivebayes, knn, tfidf, prind(probabilistic indexing),默认情况下采用的是naivebayes。
采用naivebayes分类时可以指定的选项有:
--smoothing-method=METHOD Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell. The default is laplace, which is a uniform Dirichlet prior with alpha=2.
--event-model=EVENTNAME Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word (i.e. multinomial, unigram), document (i.e. multi-variate Bernoulli, bit vector), or document-then-word (i.e. document-length-normalized multinomial). For more details on these methods, see A Comparison of Event Models for Naive Bayes Text Classification. The default is word.
--uniform-class-priors When classifying and calculating mutual information, use equal prior probabilities on classes, instead of using the distribution determined from the training data.
机器诊断
你可以查看model中的各种信息。
rainbow -d ~/model -I 10 查看前10个特征词的互信息(或文档频率、出现的次数)
rainbow -d ~/model --train-set=~/docs1 -I 10 查看文档1中前10个特征词的互信息(或文档频率、出现的次数)
rainbow -d ~/model -T 10 --print-word-probabilities=talk.politics.mideast 查看类别talk.politics.mideast中
互信息最高的前10个特征词的probability。输出形如:
god 0.05026782
people 0.64977338
government 0.24062629
car 0.03502266
game 0.00412031
team 0.01030078
bike 0.00041203
dod 0.00041203
hockey 0.00123609
windows 0.00782859
概率之和为1.
rainbow -d ~/model --print-word-counts=team 查看单词team在各个类中出现的次数及所占比重。输出形如:
2 / 125039 ( 0.00002) alt.atheism
6 / 119511 ( 0.00005) comp.graphics
5 / 91147 ( 0.00005) comp.os.ms-windows.misc
1 / 71002 ( 0.00001) comp.sys.mac.hardware
12 / 131120 ( 0.00009) comp.windows.x
15 / 62130 ( 0.00024) misc.forsale
2 / 83942 ( 0.00002) rec.autos
10 / 78685 ( 0.00013) rec.motorcycles
543 / 88623 ( 0.00613) rec.sport.baseball
970 / 115109 ( 0.00843) rec.sport.hockey
9 / 136655 ( 0.00007) sci.crypt
1 / 81206 ( 0.00001) sci.electronics
8 / 125235 ( 0.00006) sci.med
71 / 128754 ( 0.00055) sci.space
2 / 141389 ( 0.00001) soc.religion.christian
13 / 135054 ( 0.00010) talk.politics.guns
24 / 208367 ( 0.00012) talk.politics.mideast
14 / 164266 ( 0.00009) talk.politics.misc
9 / 130013 ( 0.00007) talk.religion.misc
Note: the probability of the word team is not equal to the probability of team from the --print-word-probabilities command above, because we did not reduce vocabulary size to 10 in this example.
rainbow -d ~/model --train-set=3pc --print-doc-names=train
通过--print-doc-names=train你可以查看测试集中有哪些文档
通达--print-matrix还可以输出word-document矩阵。
Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
a all
s sparse, (default)
Print word counts as integers or as binary presence/absence indicators.
b binary
i integer, (default)
How to indicate the word itself.
n integer word index
w word string
c combination of integer word index and word string, (default)
e empty, don't print anything to indicate the identity of the word
比如rainbow -d ~/model -T 100 --print-matrix=siw | head -n 10的输出是下面的形式:
~/20_newsgroups/alt.atheism/53366 alt.atheism god 2 jesus 1 nasa 2 people 2
~/20_newsgroups/alt.atheism/53367 alt.atheism jesus 2 jewish 1 christian 1
~/20_newsgroups/alt.atheism/51247 alt.atheism god 4 evidence 2
~/20_newsgroups/alt.atheism/51248 alt.atheism
~/20_newsgroups/alt.atheism/51249 alt.atheism nasa 1 country 2 files 1 law 3 system 1 government 1
~/20_newsgroups/alt.atheism/51250 alt.atheism god 3 people 2 evidence 1 law 1 system 1 public 5 rights 1 fact 1 religious 1
~/20_newsgroups/alt.atheism/51251 alt.atheism
~/20_newsgroups/alt.atheism/51252 alt.atheism people 4 evidence 2 system 2 religion 1
~/20_newsgroups/alt.atheism/51253 alt.atheism god 19 christian 1 evidence 1 faith 5 car 2 space 1 game 1
~/20_newsgroups/alt.atheism/51254 alt.atheism people 1 jewish 3 game 1 bible 7
而rainbow -d ~/model -T 10 --print-matrix=abe | head -n 10的输出是下面的形式:
~/20_newsgroups/alt.atheism/53366 alt.atheism 1 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/53367 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51247 alt.atheism 1 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51248 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51249 alt.atheism 0 0 1 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51250 alt.atheism 1 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51251 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51252 alt.atheism 0 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51253 alt.atheism 1 0 0 1 1 0 0 0 0 0
~/20_newsgroups/alt.atheism/51254 alt.atheism 0 1 0 0 1 0 0 0 0 0
通用选项
--verbosity=LEVEL(LEVEL的取值为,0,1,2,3,4,5)控制打印进度信息,数字越小打印的信息越少,默认为2。
例如rainbow -v 0 -d ~/model -I 10
凡事牵涉到随机选取的命令都可以指定种子:
rainbow -d ~/model -t 1 --test-set=0.3 --random-seed=2
rainbow -d ~/model --random-seed=1 --train-set=4pc --print-doc-names=train
Arrow
Arrow是一个单独的程序,它采用TFIDF来对文档进行检索。
为文档建立索引:arrow --index 20_newsgroups/talk.politics.*
建立索引后生成的文件放在哪儿了我还不知道。
检索:orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
america,HITCOUNT 176
20_newsgroups/talk.politics.mideast/76225 0.104700 america
20_newsgroups/talk.politics.guns/53295 0.094274 america
20_newsgroups/talk.politics.misc/178545 0.089230 america
20_newsgroups/talk.politics.misc/179112 0.088966 america
20_newsgroups/talk.politics.guns/55265 0.087667 america
20_newsgroups/talk.politics.misc/178465 0.086633 america
20_newsgroups/talk.politics.mideast/77184 0.084780 america
20_newsgroups/talk.politics.guns/54318 0.084553 america
20_newsgroups/talk.politics.misc/178790 0.081991 america
20_newsgroups/talk.politics.mideast/76286 0.081669 america
.
orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
china,HITCOUNT 30
20_newsgroups/talk.politics.mideast/76476 0.176953 china
20_newsgroups/talk.politics.mideast/76536 0.152673 china
20_newsgroups/talk.politics.guns/53341 0.151225 china
20_newsgroups/talk.politics.mideast/75405 0.151225 china
20_newsgroups/talk.politics.misc/178554 0.127059 china
20_newsgroups/talk.politics.misc/176864 0.116911 china
20_newsgroups/talk.politics.misc/178488 0.105641 china
20_newsgroups/talk.politics.misc/178326 0.104150 china
20_newsgroups/talk.politics.misc/176852 0.092056 china
20_newsgroups/talk.politics.mideast/76264 0.090448 china
.
orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
AMERICA
CHINA
,HITCOUNT 194
20_newsgroups/talk.politics.mideast/76476 0.150675 china
20_newsgroups/talk.politics.mideast/76536 0.130001 china
20_newsgroups/talk.politics.guns/53341 0.128768 china
20_newsgroups/talk.politics.mideast/75405 0.128768 china
20_newsgroups/talk.politics.misc/178554 0.108191 china
20_newsgroups/talk.politics.misc/176864 0.099550 china
20_newsgroups/talk.politics.guns/54857 0.091291 china america
20_newsgroups/talk.politics.misc/178488 0.089953 china
20_newsgroups/talk.politics.misc/178326 0.088683 china
20_newsgroups/talk.politics.misc/176852 0.078386 china
.
本文来自博客园,作者:张朝阳,转载请注明原文链接:https://www.cnblogs.com/zhangchaoyang/articles/2193557.html