Mahout文本聚类学习之TFIDFConverter类(1)
这个类通过DictionaryVectorizer类生成的tf-vectors作为输入,利用多个MapReduce Job来统计出文档数目与每一个词的文档支持度df(只要在文档中出现不管多少次都算做一次)然后计算出词频——逆文档频率并以SequenceFile存储于tfidf-vectors目录下。
这些步骤通过calculateDF()(得到每个词的文档频率)、processTfIdf()两个函数的调用来完成。
在calcalculateDF()函数中,先通过startDFCounting()函数来运行一个MapReduceJob来得到每个词的文档频率与总的文档数以<int wordid, int count>的序列文件形式存储,再通过createDictionaryChunks()函数将上一步的统计数据转化成frequency.file-n的形式并得出词汇的数目与文档数目。
下面具体看下每个函数的功能:
public static Pair<Long[], List<Path>> calculateDF(Path input, Path output, Configuration baseConf, int chunkSizeInMegabytes) throws IOException, InterruptedException, ClassNotFoundException { if (chunkSizeInMegabytes < MIN_CHUNKSIZE) { chunkSizeInMegabytes = MIN_CHUNKSIZE; } else if (chunkSizeInMegabytes > MAX_CHUNKSIZE) { // 10GB chunkSizeInMegabytes = MAX_CHUNKSIZE; } Path wordCountPath = new Path(output, WORDCOUNT_OUTPUT_FOLDER); //运行MapReduce Job,得出词的文档频率,与文档的个数存储于df-count目录下 startDFCounting(input, wordCountPath, baseConf); //将上一步的输入作为输入将文件分为不同的trunk,并抽取出文档个数与词汇的个数 return createDictionaryChunks(wordCountPath, output, baseConf, chunkSizeInMegabytes); }
其实tfidf的计算过程与tf计算过程有很大的相似性的,看下看就知道了,它把大的数据集分个小块来处理,在最后进行合并
public static void processTfIdf(Path input, Path output, Configuration baseConf, Pair<Long[], List<Path>> datasetFeatures, int minDf, long maxDF, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers) throws IOException, InterruptedException, ClassNotFoundException { Preconditions.checkArgument(normPower == PartialVectorMerger.NO_NORMALIZING || normPower >= 0, "If specified normPower must be nonnegative", normPower); Preconditions.checkArgument(normPower == PartialVectorMerger.NO_NORMALIZING || (normPower > 1 && !Double.isInfinite(normPower)) || !logNormalize, "normPower must be > 1 and not infinite if log normalization is chosen", normPower); int partialVectorIndex = 0; List<Path> partialVectorPaths = Lists.newArrayList(); //得到所有的freqency-file的路径 List<Path> dictionaryChunks = datasetFeatures.getSecond(); for (Path dictionaryChunk : dictionaryChunks) { Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++); partialVectorPaths.add(partialVectorOutputPath); makePartialVectors(input, baseConf, //得到数据集中词汇的各类 datasetFeatures.getFirst()[0], //得到数据集中的vector num也就是文档数量 datasetFeatures.getFirst()[1], minDf, maxDF, dictionaryChunk, partialVectorOutputPath, sequentialAccessOutput, namedVector); } Configuration conf = new Configuration(baseConf); //将tfidf-vecotrs设置为输出路径 Path outputDir = new Path(output, DOCUMENT_VECTOR_OUTPUT_FOLDER); //对不同部分的tfidf向量进行合并 PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, baseConf, normPower, logNormalize, datasetFeatures.getFirst()[0].intValue(), sequentialAccessOutput, namedVector, numReducers); HadoopUtil.delete(conf, partialVectorPaths); }
接下来回详细分析里面的很个函数的处理过程。

浙公网安备 33010602011771号