关于MapReduce中自定义分组类(三)
Job类
/*** Define the comparator that controls which keys are grouped together* for a single call to* {@link Reducer#reduce(Object, Iterable,* org.apache.hadoop.mapreduce.Reducer.Context)}* @param cls the raw comparator to use* @throws IllegalStateException if the job is submitted* @see #setCombinerKeyGroupingComparatorClass(Class)*/publicvoid setGroupingComparatorClass(Class<? extends RawComparator> cls) throws IllegalStateException{ensureState(JobState.DEFINE);conf.setOutputValueGroupingComparator(cls);}
JobConf类
在JobConf类中的setOutputValueGroupingComparator方法:
/*** Set the user defined {@link RawComparator} comparator for* grouping keys in the input to the reduce.** <p>This comparator should be provided if the equivalence rules for keys* for sorting the intermediates are different from those for grouping keys* before each call to* {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.</p>** <p>For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed* in a single call to the reduce function if K1 and K2 compare as equal.</p>** <p>Since {@link #setOutputKeyComparatorClass(Class)} can be used to control* how keys are sorted, this can be used in conjunction to simulate* <i>secondary sort on values</i>.</p>** <p><i>Note</i>: This is not a guarantee of the reduce sort being* <i>stable</i> in any sense. (In any case, with the order of available* map-outputs to the reduce being non-deterministic, it wouldn't make* that much sense.)</p>** @param theClass the comparator class to be used for grouping keys.* It should implement <code>RawComparator</code>.* @see #setOutputKeyComparatorClass(Class)* @see #setCombinerKeyGroupingComparator(Class)*/publicvoid setOutputValueGroupingComparator(Class<? extends RawComparator> theClass){setClass(JobContext.GROUP_COMPARATOR_CLASS,theClass,RawComparator.class);}
ctrl+O
找到getOutputValueGroupingComparator
/*** Get the user defined {@link WritableComparable} comparator for* grouping keys of inputs to the reduce.** @return comparator set by the user for grouping values.* @see #setOutputValueGroupingComparator(Class) for details.*/publicRawComparator getOutputValueGroupingComparator(){Class<? extends RawComparator> theClass = getClass(JobContext.GROUP_COMPARATOR_CLASS, null,RawComparator.class);if(theClass == null){return getOutputKeyComparator();}returnReflectionUtils.newInstance(theClass,this);}
那么谁调用了getOutputValueGroupingComparator方法
ReduceTask类
在ReduceTask类中:
(这里没有定义属性comparator,因为直接作为返回值接受接好了啊)
RawComparator comparator = job.getOutputValueGroupingComparator();
这里get到的comparator其实就是我们自定义的xxxG
于是查找,哪里用到了comparator
if(useNewApi){runNewReducer(job, umbilical, reporter, rIter, comparator,keyClass, valueClass);}else{runOldReducer(job, umbilical, reporter, rIter, comparator,keyClass, valueClass);}
因为有新旧API之分啊
所以找到该runNewReducer方法:
private<INKEY,INVALUE,OUTKEY,OUTVALUE>void runNewReducer(JobConf job,final TaskUmbilicalProtocol umbilical,final TaskReporter reporter,RawKeyValueIterator rIter,RawComparator<INKEY> comparator,Class<INKEY> keyClass,Class<INVALUE> valueClass) throws IOException,InterruptedException,ClassNotFoundException{// wrap value iterator to report progress.final RawKeyValueIterator rawIter = rIter;rIter =newRawKeyValueIterator(){publicvoid close() throws IOException{rawIter.close();}publicDataInputBuffer getKey() throws IOException{return rawIter.getKey();}publicProgress getProgress(){return rawIter.getProgress();}publicDataInputBuffer getValue() throws IOException{return rawIter.getValue();}public boolean next() throws IOException{boolean ret = rawIter.next();reporter.setProgress(rawIter.getProgress().getProgress());return ret;}};// make a task context so we can get the classesorg.apache.hadoop.mapreduce.TaskAttemptContext taskContext =new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,getTaskID(), reporter);// make a reducerorg.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)ReflectionUtils.newInstance(taskContext.getReducerClass(), job);org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =newNewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);job.setBoolean("mapred.skip.on", isSkipping());job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());org.apache.hadoop.mapreduce.Reducer.ContextreducerContext = createReduceContext(reducer, job, getTaskID(),rIter, reduceInputKeyCounter,reduceInputValueCounter,trackedRW,committer,reporter, comparator, keyClass,valueClass);try{reducer.run(reducerContext);} finally {trackedRW.close(reducerContext);}}
runNewReducer方法接收该comparator参数后传递给了createReduceContext方法
Task类
在Task里面的createReduceContext方法:
@SuppressWarnings("unchecked")protectedstatic<INKEY,INVALUE,OUTKEY,OUTVALUE>org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.ContextcreateReduceContext(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,Configuration job,org.apache.hadoop.mapreduce.TaskAttemptID taskId,RawKeyValueIterator rIter,org.apache.hadoop.mapreduce.Counter inputKeyCounter,org.apache.hadoop.mapreduce.Counter inputValueCounter,org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,org.apache.hadoop.mapreduce.OutputCommitter committer,org.apache.hadoop.mapreduce.StatusReporter reporter,RawComparator<INKEY> comparator,Class<INKEY> keyClass,Class<INVALUE> valueClass) throws IOException,InterruptedException{org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>reduceContext =newReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,rIter,inputKeyCounter,inputValueCounter,output,committer,reporter,comparator,keyClass,valueClass);
ReduceContextImpl类
找到ReduceContextImpl中找到:
publicReduceContextImpl(Configuration conf,TaskAttemptID taskid,RawKeyValueIterator input,Counter inputKeyCounter,Counter inputValueCounter,RecordWriter<KEYOUT,VALUEOUT> output,OutputCommitter committer,StatusReporter reporter,RawComparator<KEYIN> comparator,Class<KEYIN> keyClass,Class<VALUEIN> valueClass) throws InterruptedException,IOException{super(conf, taskid, output, committer, reporter);this.input = input;this.inputKeyCounter = inputKeyCounter;this.inputValueCounter = inputValueCounter;this.comparator = comparator;this.serializationFactory =newSerializationFactory(conf);this.keyDeserializer = serializationFactory.getDeserializer(keyClass);this.keyDeserializer.open(buffer);this.valueDeserializer = serializationFactory.getDeserializer(valueClass);this.valueDeserializer.open(buffer);hasMore = input.next();this.keyClass = keyClass;this.valueClass = valueClass;this.conf = conf;this.taskid = taskid;}
在ReduceContextImpl类内查找comparator
/*** Advance to the next key/value pair.*/@Overridepublic boolean nextKeyValue() throws IOException,InterruptedException{if(!hasMore){key = null;value = null;returnfalse;}firstValue =!nextKeyIsSame;DataInputBuffer nextKey = input.getKey();currentRawKey.set(nextKey.getData(), nextKey.getPosition(),nextKey.getLength()- nextKey.getPosition());buffer.reset(currentRawKey.getBytes(),0, currentRawKey.getLength());key = keyDeserializer.deserialize(key);DataInputBuffer nextVal = input.getValue();buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()- nextVal.getPosition());value = valueDeserializer.deserialize(value);currentKeyLength = nextKey.getLength()- nextKey.getPosition();currentValueLength = nextVal.getLength()- nextVal.getPosition();if(isMarked){backupStore.write(nextKey, nextVal);}hasMore = input.next();if(hasMore){nextKey = input.getKey();nextKeyIsSame = comparator.compare(currentRawKey.getBytes(),0,currentRawKey.getLength(),nextKey.getData(),nextKey.getPosition(),nextKey.getLength()- nextKey.getPosition())==0;}else{nextKeyIsSame =false;}inputValueCounter.increment(1);returntrue;}
这个compare方法,调用的是接口RawComparator中的
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
而一般如Text、IntWritable这些都实现了该方法
(一)未设置
if(theClass == null){return getOutputKeyComparator();}
/*** Get the {@link RawComparator} comparator used to compare keys.** @return the {@link RawComparator} comparator used to compare keys.*/publicRawComparator getOutputKeyComparator(){Class<? extends RawComparator> theClass = getClass(JobContext.KEY_COMPARATOR, null,RawComparator.class);if(theClass != null)returnReflectionUtils.newInstance(theClass,this);returnWritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class),this);}
没有job.setGroupingComparatorClass(xxxG.class);的时候,即使用默认的,调用Map输出的时候的key所属的类中的comparae,比如Text中的
原来默认情况下,调用的是比较器啊(更准确说是那个比较方法)
(这里比较器又分两种:
1 key的类类型中的compareTo方法
2 自定义比较器类中的compare方法
)
无论我们使用1还是2哪种方式,显然,分组和比较要么都用1 ,要么都用2,这样都是同一套规则,显然也不怎么合适。
所以我们一般是在自定义比较器类的同时又自定义分组类
(二)设置了
returnReflectionUtils.newInstance(theClass,this);
如果我们job.setGroupingComparatorClass(xxxG.class),则是创建我们自定义的这个分组类的这个xxxG
这个xxxG得继承WritableComparator类,复写compare方法
如:
public static class SelfGroupComparator extends WritableComparator{
复写compare方法即可
这样,调用逻辑和compare的一样。
我更推荐方法2
alt+左箭头,返回上一次查看源码的地方
苍茫大海,旅鼠它来!


浙公网安备 33010602011771号