Mahout文本聚类学习之DictionaryVectorizer类（2）

　　下面来看一下createDictionaryChunks（）方法，它就是要生成一个词汇到词汇id的词典，并统计出文档生成向量的维数。这个方法中会把大数据分块进行处理，每块均在内存中运行，结束后再写入磁盘，这样可以提高处理的效率。

 private static List<Path> createDictionaryChunks(Path wordCountPath,
                                                   Path dictionaryPathBase,
                                                   Configuration baseConf,
                                                   int chunkSizeInMegabytes,
                                                   int[] maxTermDimension) throws IOException {
    List<Path> chunkPaths = Lists.newArrayList();
    
    Configuration conf = new Configuration(baseConf);
    
    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);
    //用于设置此函数运行时内存块的大小，大小的计算方法为这个节点上空闲内存的大小/cup核的数量
    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;
    int chunkIndex = 0;
    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
    chunkPaths.add(chunkPath);
    
    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

    try {
      long currentChunkSize = 0;
      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);
      int i = 0;
      for (Pair<Writable,Writable> record
           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {
       //当达到一个trunk在内存中的大小时就要关闭这个Writer把数据写到磁盘中，文件命名为*.dictionary.file-n
          if (currentChunkSize > chunkSizeLimit) {
            //关闭writer,把内存数据写入磁盘  
          Closeables.closeQuietly(dictWriter);
          //重新选择一个块的序号
          chunkIndex++;
          //建一个新的存储路径
          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
          chunkPaths.add(chunkPath);
          //新建一个writer对象
          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
          //将trunk大小重新置0
          currentChunkSize = 0;
        }

        Writable key = record.getFirst();
        //内存中一条记录所占的内存大小
        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;
        currentChunkSize += fieldSize;
        dictWriter.append(key, new IntWritable(i++));
      }
      //经过对所有wordcount文件中的词的统计，得出了文档向量的维数
      maxTermDimension[0] = i;
    } finally {
      Closeables.closeQuietly(dictWriter);
    }
    
    return chunkPaths;
  }

致此已经得出了向量的维数与词汇到id映射的词典集，其中词典集按不同的trunk存储，sequenceFile的格式为<text word, int id>。

posted @ 2012-09-27 18:50 answer0107 阅读(78) 评论(0) 收藏举报

刷新页面返回顶部

answer0107

Mahout文本聚类学习之DictionaryVectorizer类（2）

公告