Mahout文本聚类学习之DictionaryVectorizer类（1）

　　将文本集分词处理后就要将分个文本用向量来表示了，在对文本聚类与分类中VSM是很常用的。而向量化的过程就是找出表示文档集的维数与每一维上的权重。Mahout中提供了用tf-idf进行权重计算的实现过程，这个过程主要分了两步，先计算tf再计算tfidf。

　　DictionaryVectorizer会将DocumentProcessor分词处理后的文档集做为输入，处理后将文档集用tf向量化，此过程中生成了一个词典，里面会记录词组到Integer型id的映射。先来看下createTermFrequencyVectors函数的核心源码吧！

 //生成"wordcount"文件目录用于记录所有文档集合中词组的总数，用MapReduce job来实现
    Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER);
    //用于记录词组的总数，也就是文档向量的维数
    int[] maxTermDimension = new int[1];
    List<Path> dictionaryChunks;
    if (maxNGramSize == 1) {
        //运行word count MapReduce job，这一步会生成<text Word, int freqency>的序列文件
      startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
      //这一步会生成若干词典文件，里面是以<text word, int id >形式来存储
      dictionaryChunks =
          createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
    } else {
      CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize,
        minSupport, minLLRValue, numReducers);
      dictionaryChunks =
          createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER),
                                          CollocDriver.NGRAM_OUTPUT_DIRECTORY),
                                 output,
                                 baseConf,
                                 chunkSizeInMegabytes,
                                 maxTermDimension);
    }
    
    int partialVectorIndex = 0;
    Collection<Path> partialVectorPaths = Lists.newArrayList();
    for (Path dictionaryChunk : dictionaryChunks) {
      Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);
      partialVectorPaths.add(partialVectorOutputPath);
      //这个任务比较有意思，它循环执行了词典的trunk数，每一次写文档向量的一部分，在底下再合并起来
      //这样做应该是充分利用了内存，所以是做了性能优化的。
      makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath,
        maxTermDimension[0], sequentialAccess, namedVectors, numReducers);
    }
    
    Configuration conf = new Configuration(baseConf);

    Path outputDir = new Path(output, tfVectorsFolderName);
    //这一步会做一个向量的加法操作，将结果输出到tf-vectors目录中去
    PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, conf, normPower, logNormalize,
      maxTermDimension[0], sequentialAccess, namedVectors, numReducers);
    //删除临时目录
    HadoopUtil.delete(conf, partialVectorPaths);

从上面的核心函数中一步步来看具体的操作，由于对中文进行分词MaxNgram设成了1。

　　首先成生wordcount目录，这个目录中的序列文件记录了所有词汇的词频率，它的功能就是把词频少于我们minsupport的词去掉，然后它会在计算idf时会用到。做这个工作用Mapreduce任务来实现，主要设及了TermCountMapper、TermCountCombiner、TermCountReducer三个类。下面是具体的分析。combiner对本地的map数据做了优化，相当于本地reduce操作，可以减少网络带宽占用。

public class TermCountMapper extends Mapper<Text, StringTuple, Text, LongWritable> {

  @Override
  protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException {
    //保存了本map任务分配的数据块中的词与词频映射
    OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();
    
    for (String word : value.getEntries()) {
      if (wordCount.containsKey(word)) {
        wordCount.put(word, wordCount.get(word) + 1);
      } else {
        wordCount.put(word, 1);
      }
    }
    wordCount.forEachPair(new ObjectLongProcedure<String>() {
      @Override
      public boolean apply(String first, long second) {
        try {
            //写到本地
          context.write(new Text(first), new LongWritable(second));
        } catch (IOException e) {
          context.getCounter("Exception", "Output IO Exception").increment(1);
        } catch (InterruptedException e) {
          context.getCounter("Exception", "Interrupted Exception").increment(1);
        }
        return true;
      }
    });
  }
}

public class TermCountCombiner extends Reducer<Text, LongWritable, Text, LongWritable> {

  @Override
  protected void reduce(Text key, Iterable<LongWritable> values, Context context)
    throws IOException, InterruptedException {
    long sum = 0;
    //简单对本地map任务产生的数据做reduce，只是性能优化，记住没有用minsupport哦
    for (LongWritable value : values) {
      sum += value.get();
    }
    context.write(key, new LongWritable(sum));
  }

}

public class TermCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

  private int minSupport;

  @Override
  protected void reduce(Text key, Iterable<LongWritable> values, Context context)
    throws IOException, InterruptedException {
    long sum = 0;
    //最终总词频的计算
    for (LongWritable value : values) {
      sum += value.get();
    }
    //去除支持度小于给定阈值的词汇
    if (sum >= minSupport) {
      context.write(key, new LongWritable(sum));
    }
  }

  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    //设置minsupport去除词频小的词汇
    minSupport = context.getConfiguration().getInt(DictionaryVectorizer.MIN_SUPPORT,
                                                   DictionaryVectorizer.DEFAULT_MIN_SUPPORT);
  }

}

　　完成此job后wordcount中就记录了所有满足给定支持度的词与词频的记录了。

posted @ 2012-09-27 17:51 answer0107 阅读(175) 评论(0) 收藏举报

刷新页面返回顶部

answer0107

Mahout文本聚类学习之DictionaryVectorizer类（1）

公告