Mahout文本聚类学习之DictionaryVectorizer类(2)
下面来看一下createDictionaryChunks()方法,它就是要生成一个词汇到词汇id的词典,并统计出文档生成向量的维数。这个方法中会把大数据分块进行处理,每块均在内存中运行,结束后再写入磁盘,这样可以提高处理的效率。
private static List<Path> createDictionaryChunks(Path wordCountPath, Path dictionaryPathBase, Configuration baseConf, int chunkSizeInMegabytes, int[] maxTermDimension) throws IOException { List<Path> chunkPaths = Lists.newArrayList(); Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf); //用于设置此函数运行时内存块的大小,大小的计算方法为这个节点上空闲内存的大小/cup核的数量 long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L; int chunkIndex = 0; Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); try { long currentChunkSize = 0; Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN); int i = 0; for (Pair<Writable,Writable> record : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) { //当达到一个trunk在内存中的大小时就要关闭这个Writer把数据写到磁盘中,文件命名为*.dictionary.file-n if (currentChunkSize > chunkSizeLimit) { //关闭writer,把内存数据写入磁盘 Closeables.closeQuietly(dictWriter); //重新选择一个块的序号 chunkIndex++; //建一个新的存储路径 chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); //新建一个writer对象 dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); //将trunk大小重新置0 currentChunkSize = 0; } Writable key = record.getFirst(); //内存中一条记录所占的内存大小 int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8; currentChunkSize += fieldSize; dictWriter.append(key, new IntWritable(i++)); } //经过对所有wordcount文件中的词的统计,得出了文档向量的维数 maxTermDimension[0] = i; } finally { Closeables.closeQuietly(dictWriter); } return chunkPaths; }
致此已经得出了向量的维数与词汇到id映射的词典集,其中词典集按不同的trunk存储,sequenceFile的格式为<text word, int id>。

浙公网安备 33010602011771号