Mahout文本聚类学习之DictionaryVectorizer类(2)

  下面来看一下createDictionaryChunks()方法,它就是要生成一个词汇到词汇id的词典,并统计出文档生成向量的维数。这个方法中会把大数据分块进行处理,每块均在内存中运行,结束后再写入磁盘,这样可以提高处理的效率。

 private static List<Path> createDictionaryChunks(Path wordCountPath,
                                                   Path dictionaryPathBase,
                                                   Configuration baseConf,
                                                   int chunkSizeInMegabytes,
                                                   int[] maxTermDimension) throws IOException {
    List<Path> chunkPaths = Lists.newArrayList();
    
    Configuration conf = new Configuration(baseConf);
    
    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);
    //用于设置此函数运行时内存块的大小,大小的计算方法为这个节点上空闲内存的大小/cup核的数量
    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;
    int chunkIndex = 0;
    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
    chunkPaths.add(chunkPath);
    
    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

    try {
      long currentChunkSize = 0;
      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);
      int i = 0;
      for (Pair<Writable,Writable> record
           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {
       //当达到一个trunk在内存中的大小时就要关闭这个Writer把数据写到磁盘中,文件命名为*.dictionary.file-n
          if (currentChunkSize > chunkSizeLimit) {
            //关闭writer,把内存数据写入磁盘  
          Closeables.closeQuietly(dictWriter);
          //重新选择一个块的序号
          chunkIndex++;
          //建一个新的存储路径
          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
          chunkPaths.add(chunkPath);
          //新建一个writer对象
          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
          //将trunk大小重新置0
          currentChunkSize = 0;
        }

        Writable key = record.getFirst();
        //内存中一条记录所占的内存大小
        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;
        currentChunkSize += fieldSize;
        dictWriter.append(key, new IntWritable(i++));
      }
      //经过对所有wordcount文件中的词的统计,得出了文档向量的维数
      maxTermDimension[0] = i;
    } finally {
      Closeables.closeQuietly(dictWriter);
    }
    
    return chunkPaths;
  }

致此已经得出了向量的维数与词汇到id映射的词典集,其中词典集按不同的trunk存储,sequenceFile的格式为<text word, int id>。

posted @ 2012-09-27 18:50  answer0107  阅读(78)  评论(0)    收藏  举报