关于MapReduce中自定义带比较key类、比较器类(二)——初学者从源码查看其原理
Job类
/*** Define the comparator that controls* how the keys are sorted before they* are passed to the {@link Reducer}.* @param cls the raw comparator* @see #setCombinerKeyGroupingComparatorClass(Class)*/publicvoid setSortComparatorClass(Class<? extends RawComparator> cls) throws IllegalStateException{ensureState(JobState.DEFINE);conf.setOutputKeyComparatorClass(cls);}
Define the comparator that controls how the keys are sorted before they 定义一个比较器,控制keys在被传递给Reducer之前是如何排序的
<? extends RawComparator>
是泛型的向下限定,要么是RawComparator类型,要是RawComparator的子类()
RawComparator
接口Comparator
——子接口RawComparator:Compare two objects in binary.
compare方法
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
——子实现类WritableComparator
既然cls必须是类型或其子类类型,那么如果我们自定义的key类是WritableComparator也可以的
JonConf类
点击setOutputKeyComparatorClass,链接到JonConf类中
/*** Set the {@link RawComparator} comparator used to compare keys.* @param theClass the {@link RawComparator} comparator used to* compare keys.* @see #setOutputValueGroupingComparator(Class)*/设定用于比较key的比较器,theClass参数就是那个比较器啦publicvoid setOutputKeyComparatorClass(Class<?extendsRawComparator> theClass){setClass(JobContext.KEY_COMPARATOR,theClass,RawComparator.class);}
Set the {@link RawComparator} comparator used to compare keys.* @param theClass the {@link RawComparator} comparator used to* compare keys.
设置用于比较key的比较器,参数theClass 就是这个比较器
setClass(JobContext.KEY_COMPARATOR,theClass,RawComparator.class);
关于setClass
* An exception is thrown if <code>theClass</code> does not implement the
* interface <code>xface</code>.
setClass的意思,从JobContext中取出KEY_COMPARATOR属性的值,该值对应的类要是RawComparator本身类型或其子类类型,如果不是其子类类型,则会报错。即。theClass实现了RawComparator。
既然有setOutputKeyComparatorClass,j就会有getOutputKeyComparator。仍然在JobConf类中找到
/**
* Get the {@link RawComparator} comparator used to compare keys.
获取到一个用于比较key的比较器,并返回,返回类型是RawComparator
* @return the {@link RawComparator} comparator used to compare keys.
*/
publicRawComparator getOutputKeyComparator(){
Class<? extends RawComparator> theClass = getClass(
JobContext.KEY_COMPARATOR, null,RawComparator.class);
如果KEY_COMPARATOR属性中没值,则返回null
if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);
如果不为空,则就通过反射创建theClass
否则,使用默认的
returnWritableComparator.get(getMapOutputKeyClass().
asSubclass(WritableComparable.class),this);
}
if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);
假如我们制定了一个比较器类,即job.setSortComparatorClass(xxxS.class),xxxS,class继承了WritableComparator类型,复写了其中的compare方法。
MapTask$MapOutputBuffer类
到了这里,有一个疑问(强迫症患者专用),那么是谁来调用这个getOutputKeyComparator方法的呢?
在MapTask类中有一个内部类MapOutputBuffer:
属性:private RawComparator<K> comparator;
属性被赋值:
// k/v serialization
comparator = job.getOutputKeyComparator();
可见是在序列化的时候被调用赋值了
ctrl+shift+P 跳转到匹配的括号
方法:compare
/*** Compare logical range, st i, j MOD offset capacity.* Compare by partition, then by key.* @see IndexedSortable#compare*/publicint compare(final int mi, final int mj){final int kvi = offsetFor(mi % maxRec);final int kvj = offsetFor(mj % maxRec);final int kvip = kvmeta.get(kvi + PARTITION);final int kvjp = kvmeta.get(kvj + PARTITION);// sort by partitionif(kvip != kvjp){return kvip - kvjp;}// sort by keyreturn comparator.compare(kvbuffer,kvmeta.get(kvi + KEYSTART),kvmeta.get(kvi + VALSTART)- kvmeta.get(kvi + KEYSTART),kvbuffer,kvmeta.get(kvj + KEYSTART),kvmeta.get(kvj + VALSTART)- kvmeta.get(kvj + KEYSTART));- }
而在RawComparator中:
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
所以,当我们传递了一个WritableComparator的子类xxxS的时候,其实此时调用的是子类xxxS继承自WritableComparator类的那个compare方法,只不过其还有另一个重载的compare方法
如下即为WritableComparator类中的这个compare
/** Optimization hook. Override this to make SequenceFile.Sorter's scream.** <p>The default implementation reads the data into two {@link* WritableComparable}s (using {@link* Writable#readFields(DataInput)}, then calls {@link* #compare(WritableComparable,WritableComparable)}.*/@Overridepublicint compare(byte[] b1,int s1,int l1, byte[] b2,int s2,int l2){try{buffer.reset(b1, s1, l1); // parse key1key1.readFields(buffer);buffer.reset(b2, s2, l2); // parse key2key2.readFields(buffer);}catch(IOException e){thrownewRuntimeException(e);}return compare(key1, key2); // compare them}
其实我看了下,前面部分应该是在通过数组来读取到两个key——key1、key2
最终调用的是: compare(key1, key2);
/** Compare two WritableComparables.* <p> The default implementation uses the natural ordering, calling {@link* Comparable#compareTo(Object)}. */@SuppressWarnings("unchecked")publicint compare(WritableComparable a,WritableComparable b){return a.compareTo(b);}
此时,调用的是WritableComparable类中的compareTo方法,而这个方法被我们复写了。
(自定义类实现了WritableComparable接口,并复写了该compareTo方法)
还有一点,之前不是提到,如果要用setSortComparatorClass,则必须是RawComparator类型或其子类嘛?
(一)
我们如果是自定义key类——keyxxxS类,且实现了WritableComparable接口,复写CompareTo方法
此时,不用set,
此时。它会return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
/*** Get the key class for the map output data. If it is not set, use the* (final) output key class. This allows the map output key class to be* different than the final output key class.** @return the map output key class.*/publicClass<?> getMapOutputKeyClass(){Class<?> retv = getClass(JobContext.MAP_OUTPUT_KEY_CLASS, null,Object.class);if(retv == null){retv = getOutputKeyClass();}return retv;}
顾名思义。就是获取key的类——即job.setMapOutputClass(xxx.class)中的那个,比如Text,比如我们自定义的keyxxxS
怎么自定义key类——keyxxxS类的
WritableComparable接口的声明:
public interface WritableComparable<T> extends Writable,Comparable<T>
/*** A serializable object which implements a simple, efficient, serialization* protocol, based on {@link DataInput} and {@link DataOutput}.一个实现了一个简单高效的序列化协议(基于....)的可序列化的对象* <p>Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce* framework implements this interface.</p>在hadoop mp框架中。任何一个key或者value类型实现该接口(意思就是说,任意键和值所属的类型应该实现该接口咯)-
比如Text,IntWritable我们查看查看Text类的源码验证之
publicclassText extends BinaryComparableimplements WritableComparable<BinaryComparable>{}
*<p>Implementations typically implement a static<code>read(DataInput)</code>* method which constructs a new instance, calls {@link#readFields(DataInput)}* and returns the instance.</p>实现类通常实现一个静态的read方法——它构建一个新的实例,调用readFields,返回实例
下面是注释中给出的一个完整的例子:
<p>Example:</p>*<p><blockquote><pre>* publicclassMyWritableComparable implements WritableComparable<MyWritableComparable>{* // Some data* privateint counter;* privatelong timestamp;** publicvoid write(DataOutput out) throws IOException{* out.writeInt(counter);* out.writeLong(timestamp);* }** publicvoid readFields(DataInput in) throws IOException{* counter = in.readInt();* timestamp = in.readLong();* }** publicint compareTo(MyWritableComparable o){* int thisValue =this.value;* int thatValue = o.value;* return(thisValue < thatValue ?-1:(thisValue==thatValue ?0:1));* }** publicint hashCode(){* final int prime =31;* int result =1;* result = prime * result + counter;* result = prime * result +(int)(timestamp ^(timestamp >>>32));* return result* }* }
(二)
如果是自定义比较器xxxS类,则继承WritableComparator类,复写其中的compare方法
并且要job.setSortComparatorClass(xxxS)
(也是返回一个RawComparator的子实现类,还是会调用复写后的compareTo方法的)
怎么自定义比较器类xxxS的
classWritableComparator implements RawComparator,ConfigurableA Comparatorfor{@linkWritableComparable}s.*<p>This base implemenation uses the natural ordering. To define alternate* orderings, override {@link#compare(WritableComparable,WritableComparable)}.*<p>One may optimize compare-intensive operations by overriding*{@link#compare(byte[],int,int,byte[],int,int)}. Static utility methods are* provided to assist in optimized implementations of this method.
WritableComparator类是一个给WritableComparablel类对象的比较器
这个基本实现类使用的是自然顺序排序。如果要自定义,则复写compare方法
##########################################################
参考:
苍茫大海,旅鼠它来!


浙公网安备 33010602011771号