MapReduce库类
Hadoop除了可以让开发人员自行编写map函数和reduce函数,还提供一些常用函数(mapper、reducer和partitioner)的类库,这些类位于 org.apache.hadoop.mapred.lib 包内,在1.2.1版,该包包含一个接口和若干类。在org.apache.hadoop.mapreduce.lib 包内也存在相关类库,且有部分重复。mapred包内部是旧API,mapreduce包是重构之后的新API,但两者都可以使用。
接口如下:
| InputSampler.Sampler<K,V> | Interface to sample using an InputFormat. |
类如下:
| BinaryPartitioner<V> | Partition BinaryComparable keys using a configurable part of the bytes array returned by BinaryComparable.getBytes(). |
| ChainMapper | The ChainMapper class allows to use multiple Mapper classes within a single Map task. |
| ChainReducer | The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task. |
| CombineFileInputFormat<K,V> | An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobConf, int) method. |
| CombineFileRecordReader<K,V> | A generic RecordReader that can hand out different recordReaders for each chunk in a CombineFileSplit. |
| CombineFileSplit | A sub-collection of input files. |
| DelegatingInputFormat<K,V> | An InputFormat that delegates behaviour of paths to multiple other InputFormats. |
| DelegatingMapper<K1,V1,K2,V2> | An Mapper that delegates behaviour of paths to multiple other mappers. |
| FieldSelectionMapReduce<K,V> | This class implements a mapper/reducer class that can be used to perform field selections in a manner similar to unix cut. |
| HashPartitioner<K2,V2> | Partition keys by their Object.hashCode(). |
| IdentityMapper<K,V> | Implements the identity function, mapping inputs directly to outputs. |
| IdentityReducer<K,V> | Performs no reduction, writing all input values directly to the output. |
| InputSampler<K,V> | Utility for collecting samples and writing a partition file for TotalOrderPartitioner. |
| InputSampler.IntervalSampler<K,V> | Sample from s splits at regular intervals. |
| InputSampler.RandomSampler<K,V> | Sample from random points in the input. |
| InputSampler.SplitSampler<K,V> | Samples the first n records from s splits. |
| InverseMapper<K,V> | A Mapper that swaps keys and values. |
| KeyFieldBasedComparator<K,V> | This comparator implementation provides a subset of the features provided by the Unix/GNU Sort. |
| KeyFieldBasedPartitioner<K2,V2> | Defines a way to partition keys based on certain key fields (also see KeyFieldBasedComparator. |
| LongSumReducer<K> | A Reducer that sums long values. |
| MultipleInputs | This class supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path |
| MultipleOutputFormat<K,V> | This abstract class extends the FileOutputFormat, allowing to write the output data to different output files. |
| MultipleOutputs | The MultipleOutputs class simplifies writting to additional outputs other than the job default output via the OutputCollectorpassed to the map() and reduce() methods of the Mapper and Reducer implementations. |
| MultipleSequenceFileOutputFormat<K,V> | This class extends the MultipleOutputFormat, allowing to write the output data to different output files in sequence file output format. |
| MultipleTextOutputFormat<K,V> | This class extends the MultipleOutputFormat, allowing to write the output data to different output files in Text output format. |
| MultithreadedMapRunner<K1,V1,K2,V2> | Multithreaded implementation for @link org.apache.hadoop.mapred.MapRunnable. |
| NLineInputFormat | NLineInputFormat which splits N lines of input as one split. |
| NullOutputFormat<K,V> | Consume all outputs and put them in /dev/null. |
| RegexMapper<K> | A Mapper that extracts text matching a regular expression. |
| TokenCountMapper<K> | A Mapper that maps text values into <token,freq>pairs. |
| TotalOrderPartitioner<K extends WritableComparable,V> | Partitioner effecting a total order by reading split points from an externally generated source. |
目前,用到的有一下几个类,后续将对其他类及接口进行研究。
1)ChainMapper类和ChainReducer类:可以在一个mapper中运行多个mapper,再运行reducer,之后还可以再运行多个mapper。这两个类组合使用,用于需要执行多个mapreduce过程的情况。这个方案可以明显降低磁盘的I/O开销。
2)TokenCounterMapper类:将输入值分解成独立的单词(使用Java的StringTokenizer)、输出各单词及其计数器(值为1)
3)InverseMapper类:一个能交换键和值的mapper
参考资料:
1. hadoop API 文档
2. Hadoop 权威指南
浙公网安备 33010602011771号