MapReduce设计模式之In-mapper Combining

背景

在Hadoop的整个框架中，设计了Combine-Partition结构。其目的是减少数据通信同步的开销。

但实际上，Hadoop的Combiner和Partitioner在Shuffle和Sort之后执行。且C/P何时被调用，调用几次都是不确定的。

这就给编写高效率的Hadoop程序提出了挑战。有没有解决办法呢？答案是肯定的。

分析算法

Hadoop提供一系列机制来保存Mapper和Reducer的状态。如下图，一个Mapper实例（Reducer实例同理）的状态可以用一个State变量保存。Configure阶段可以看成C++里面的构造函数，Close阶段可以看成C++的析构函数。其中State变量应该在Configure阶段声明。（如果State变量在Map阶段声明，那么在实际运行过程中就会大量创建新对象，且这种State并不能跨运算行。）

算法实现

org.apache.hadoop.mapreduce.Mapper类的setup方法实现了Configure的功能，cleanup方法则实现了Close的功能。我们重写即可。

public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {

		private HMapKI<String> map = null;

		@Override
				protected void setup(Context context) throws IOException,
						  InterruptedException {
								  map = new HMapKI<String>();
				}

		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
				InputStream is = new ByteArrayInputStream(value.toString().getBytes());
				NxParser parser = new NxParser(is);
				if (parser.hasNext()) {
						StringTokenizer itr = new StringTokenizer(nodes[2].toString());
						while (itr.hasMoreTokens()) {
								String t = itr.nextToken();
								map.increment(t);
						}
				}
		}

		@Override
				protected void cleanup(Context context) throws IOException,
						  InterruptedException {
								  for (String word : map.keySet()) {
										  context.write(new Text(word), new IntWritable(map.get(word)));
								  }
				}
}

解释一下其中的HMapKI类，来自于Jimmy Lin写的cloud9，http://www.umiacs.umd.edu/~jimmylin/projects/index.html 。这个包封装了很多mapreduce常用的数据结构，很好用。另外，强烈推荐Jimmy Lin写的<Data-Intensive Text Processing>这本书，本算法也来自于这本书。

posted on 2012-12-16 01:45 DingaGa 阅读(712) 评论(0) 收藏举报

刷新页面返回顶部

DingaGa's Blog