Mahout文本聚类学习之DictionaryVectorizer类(1)
将文本集分词处理后就要将分个文本用向量来表示了,在对文本聚类与分类中VSM是很常用的。而向量化的过程就是找出表示文档集的维数与每一维上的权重。Mahout中提供了用tf-idf进行权重计算的实现过程,这个过程主要分了两步,先计算tf再计算tfidf。
DictionaryVectorizer会将DocumentProcessor分词处理后的文档集做为输入,处理后将文档集用tf向量化,此过程中生成了一个词典,里面会记录词组到Integer型id的映射。先来看下createTermFrequencyVectors函数的核心源码吧!
//生成"wordcount"文件目录用于记录所有文档集合中词组的总数,用MapReduce job来实现 Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER); //用于记录词组的总数,也就是文档向量的维数 int[] maxTermDimension = new int[1]; List<Path> dictionaryChunks; if (maxNGramSize == 1) { //运行word count MapReduce job,这一步会生成<text Word, int freqency>的序列文件 startWordCounting(input, dictionaryJobPath, baseConf, minSupport); //这一步会生成若干词典文件,里面是以<text word, int id >形式来存储 dictionaryChunks = createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension); } else { CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize, minSupport, minLLRValue, numReducers); dictionaryChunks = createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER), CollocDriver.NGRAM_OUTPUT_DIRECTORY), output, baseConf, chunkSizeInMegabytes, maxTermDimension); } int partialVectorIndex = 0; Collection<Path> partialVectorPaths = Lists.newArrayList(); for (Path dictionaryChunk : dictionaryChunks) { Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++); partialVectorPaths.add(partialVectorOutputPath); //这个任务比较有意思,它循环执行了词典的trunk数,每一次写文档向量的一部分,在底下再合并起来 //这样做应该是充分利用了内存,所以是做了性能优化的。 makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath, maxTermDimension[0], sequentialAccess, namedVectors, numReducers); } Configuration conf = new Configuration(baseConf); Path outputDir = new Path(output, tfVectorsFolderName); //这一步会做一个向量的加法操作,将结果输出到tf-vectors目录中去 PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, conf, normPower, logNormalize, maxTermDimension[0], sequentialAccess, namedVectors, numReducers); //删除临时目录 HadoopUtil.delete(conf, partialVectorPaths);
从上面的核心函数中一步步来看具体的操作,由于对中文进行分词MaxNgram设成了1。
首先成生wordcount目录,这个目录中的序列文件记录了所有词汇的词频率,它的功能就是把词频少于我们minsupport的词去掉,然后它会在计算idf时会用到。做这个工作用Mapreduce任务来实现,主要设及了TermCountMapper、TermCountCombiner、TermCountReducer三个类。下面是具体的分析。combiner对本地的map数据做了优化,相当于本地reduce操作,可以减少网络带宽占用。
public class TermCountMapper extends Mapper<Text, StringTuple, Text, LongWritable> { @Override protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException { //保存了本map任务分配的数据块中的词与词频映射 OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>(); for (String word : value.getEntries()) { if (wordCount.containsKey(word)) { wordCount.put(word, wordCount.get(word) + 1); } else { wordCount.put(word, 1); } } wordCount.forEachPair(new ObjectLongProcedure<String>() { @Override public boolean apply(String first, long second) { try { //写到本地 context.write(new Text(first), new LongWritable(second)); } catch (IOException e) { context.getCounter("Exception", "Output IO Exception").increment(1); } catch (InterruptedException e) { context.getCounter("Exception", "Interrupted Exception").increment(1); } return true; } }); } }
public class TermCountCombiner extends Reducer<Text, LongWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; //简单对本地map任务产生的数据做reduce,只是性能优化,记住没有用minsupport哦 for (LongWritable value : values) { sum += value.get(); } context.write(key, new LongWritable(sum)); } }
public class TermCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { private int minSupport; @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; //最终总词频的计算 for (LongWritable value : values) { sum += value.get(); } //去除支持度小于给定阈值的词汇 if (sum >= minSupport) { context.write(key, new LongWritable(sum)); } } @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); //设置minsupport去除词频小的词汇 minSupport = context.getConfiguration().getInt(DictionaryVectorizer.MIN_SUPPORT, DictionaryVectorizer.DEFAULT_MIN_SUPPORT); } }
完成此job后wordcount中就记录了所有满足给定支持度的词与词频的记录了。

浙公网安备 33010602011771号