MapReduce 的类型与格式【编写最简单的mapreduce】(1)
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
能够从源码中看出为什么是这种类型:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
publicclassMapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{publicclassContextextendsMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{// ...}protectedvoidmap(KEYINkey,VALUEINvalue,Contextcontext)throwsIOException,InterruptedException{// ...}}publicclassReducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{publicclassContextextendsReducerContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{// ...}protectedvoidreduce(KEYINkey,Iterable<VALUEIN>values,Contextcontext)throwsIOException,InterruptedException{// ...}}context用来接收输出键值对,写出的方法是:假设有combiner :这里的 combiner就是默认的reducerpublicvoidwrite(KEYOUTkey,VALUEOUTvalue)throwsIOException,InterruptedExceptionmap: (K1, V1) → list(K2, V2) combiner: (K2, list(V2)) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3)假设partitioner被使用:partition: (K2, V2) → integer(非常多时候仅仅取决于key 值被忽略来进行分区)
以及combiner 甚至partitioner让同样的key聚合到一起
publicabstractclassPartitioner<KEY,VALUE>{publicabstractintgetPartition(KEYkey,VALUEvalue,intnumPartitions);}一个实现类:public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}输入数据的类型是通过输入格式进行设定的。比如,对于TextlnputFormat ,它的键类型就是LongWritable 。而值类型就是Text 。其它的类型能够通过调用JobConf 中的方法来进行显式地设置。假设没有显式地设置。 中阔的类型将被默认设置为(终于的)输出类型,也就是LongWritable 和Text.综上所述,假设K2 与K3是同样类型,就不须要手工调用setMapOutputKeyClass,由于它将被自己主动设置每个步骤的输入和输出类型.一定非常奇怪,为什么不能从最初输入的类型推导出每个步骤的输入/输出类型呢?
原来Java 的泛型机制具有非常多限制,类型擦除导致了运行时类型并不一直可见.所以须要Hadoop 时不时地"提醒"一下。这也导致了可能在某些MapReduce 任务中出现不兼容的输入和输出类型,由于这些配置在编译时无法检查出来。与MapReduce 任务兼容的类型已经在以下列出。全部的类型不兼容将在任务真正运行的时候被发现,所以一个比較聪明的做法是在运行任务前先用少量的数据跑一次測试任务。以发现全部的类型不兼容问题。
Table 8-1. Configuration of MapReduce types in the new API
Property Job setter method Input types Intermediate types Output types K1V1K2V2K3V3Properties for configuring types: mapreduce.job.inputformat.classsetInputFormatClass()• • mapreduce.map.output.key.classsetMapOutputKeyClass()• mapreduce.map.output.value.classsetMapOutputValueClass()• mapreduce.job.output.key.classsetOutputKeyClass()• mapreduce.job.output.value.classsetOutputValueClass()• Properties that must be consistent with the types: mapreduce.job.map.classsetMapperClass()• • • • mapreduce.job.combine.classsetCombinerClass()• • mapreduce.job.partitioner.classsetPartitionerClass()• • mapreduce.job.output.key.comparator.classsetSortComparatorClass()• mapreduce.job.output.group.comparator.classsetGroupingComparatorClass()• mapreduce.job.reduce.classsetReducerClass()• • • • mapreduce.job.outputformat.classsetOutputFormatClass()• • Table 8-2. Configuration of MapReduce types in the old API
Property JobConf setter method Input types Intermediate types Output types K1V1K2V2K3V3Properties for configuring types: mapred.input.format.classsetInputFormat()• • mapred.mapoutput.key.classsetMapOutputKeyClass()• mapred.mapoutput.value.classsetMapOutputValueClass()• mapred.output.key.classsetOutputKeyClass()• mapred.output.value.classsetOutputValueClass()• Properties that must be consistent with the types: mapred.mapper.classsetMapperClass()• • • • mapred.map.runner.classsetMapRunnerClass()• • • • mapred.combiner.classsetCombinerClass()• • mapred.partitioner.classsetPartitionerClass()• • mapred.output.key.comparator.classsetOutputKeyComparatorClass()• mapred.output.value.groupfn.classsetOutputValueGroupingComparator()• mapred.reducer.classsetReducerClass()• • • • mapred.output.format.classsetOutputFormat()• •
一个最简单的hadoop mapreduce:运行方法:publicclassMinimalMapReduceextendsConfiguredimplementsTool{@Overridepublicintrun(String[]args)throwsException{if(args.length!=2){System.err.printf("Usage: %s [generic options] <input> <output>\n",getClass().getSimpleName());ToolRunner.printGenericCommandUsage(System.err);return-1;}Jobjob=newJob(getConf());job.setJarByClass(getClass());FileInputFormat.addInputPath(job,newPath(args[0]));FileOutputFormat.setOutputPath(job,newPath(args[1]));returnjob.waitForCompletion(true)?0:1;}publicstaticvoidmain(String[]args)throwsException{intexitCode=ToolRunner.run(newMinimalMapReduce(),args);System.exit(exitCode);}}hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output输出结果:0→0029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591... 0→0035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181... 135→0029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821... 141→0035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181... 270→0029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001... 282→0035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...改默认最简mapreduce等同于一下的程序:publicclassMinimalMapReduceWithDefaultsextendsConfiguredimplementsTool{@Overridepublicintrun(String[]args)throwsException{Jobjob=JobBuilder.parseInputAndOutput(this,getConf(),args);if(job==null){return-1;}job.setInputFormatClass(TextInputFormat.class);job.setMapperClass(Mapper.class);job.setMapOutputKeyClass(LongWritable.class);job.setMapOutputValueClass(Text.class);job.setPartitionerClass(HashPartitioner.class);job.setNumReduceTasks(1);job.setReducerClass(Reducer.class);job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);job.setOutputFormatClass(TextOutputFormat.class);returnjob.waitForCompletion(true)?0:1;}publicstaticvoidmain(String[]args)throwsException{intexitCode=ToolRunner.run(newMinimalMapReduceWithDefaults(),args);System.exit(exitCode);}}那么。默认使用的mapreduce是:Mapper.classHashPartitioner.classReducer.class默认map代码,就是读取key value输出publicclassMapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{protectedvoidmap(KEYINkey,VALUEINvalue,Contextcontext)throwsIOException,InterruptedException{context.write((KEYOUT)key,(VALUEOUT)value);}}默认Partitioner:hash切割,默认仅仅有一个reducer因此我们这里仅仅有一个分区默认Reduce 输出传进来的数据:classHashPartitioner<K,V>extendsPartitioner<K,V>{publicintgetPartition(Kkey,Vvalue,intnumReduceTasks){return(key.hashCode()&Integer.MAX_VALUE)%numReduceTasks;}}publicclassReducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{protectedvoidreduce(KEYINkey,Iterable<VALUEIN>values,ContextcontextContextcontext)throwsIOException,InterruptedException{for(VALUEINvalue:values){context.write((KEYOUT)key,(VALUEOUT)value);}}}由于什么都没做,仅仅是在map中读取了偏移量和value,分区使用的hash,一个reduce输出的便是我们上面看到的样子。相对于java api,hadoop流也有最简的mapreduce:等于以下的命令:%hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \-input input/ncdc/sample.txt \-output output \-mapper /bin/cat%hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \-input input/ncdc/sample.txt \-output output \-inputformat org.apache.hadoop.mapred.TextInputFormat \-mapper /bin/cat \-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \-numReduceTasks 1 \-reducer org.apache.hadoop.mapred.lib.IdentityReducer \-outputformat org.apache.hadoop.mapred.TextOutputFormat-io text流操作的键与值一个文本文件流怎么知道哪里是一个记录的结束呢?流分隔符:一个流操作的程序能够改动输入的分隔符(用于将键与值从输入文件里分开而且传入mapper) 。默认情况下是Tab ,可是假设输入的键或值中本身有Tab 分隔符的话,最好将分隔符改动成其它符号。类似地,当map 和reduc e 将结果输出的时候, 也须要一个能够配置的分隔符选项。更进一步, 键能够不仅仅是每一条记录的第1 个字段,它能够是一条记录的前n 个字段(能够在stream.num.map.output.key.fields和stream.num.reduce.output.key.fields 中进行设置) 。而剩下的字段就是值。比方有一条记录是a 。b , C 。 且用逗号分隔,假设n 设为2 ,那么键就是a 、b 。而值就是c 。
Table 8-3. Streaming separator properties
Property name Type Default value Description stream.map.input.field.separatorString\tThe separator to use when passing the input key and value strings to the stream map process as a stream of bytes stream.map.output.field.separatorString\tThe separator to use when splitting the output from the stream map process into key and value strings for the map output stream.num.map.output.key.fieldsint1The number of fields separated by stream.map.output.field.separatorto treat as the map output keystream.reduce.input.field.separatorString\tThe separator to use when passing the input key and value strings to the stream reduce process as a stream of bytes stream.reduce.output.field.separatorString\tThe separator to use when splitting the output from the stream reduce process into key and value strings for the final reduce output stream.num.reduce.output.key.fieldsint1The number of fields separated by stream.reduce.output.field.separatorto treat as the reduce output keymapreduce中分隔符使用的地方。在标准输入输出和map-reducer之间。
浙公网安备 33010602011771号