在 Hadoop 上编写 MapReduce 程序 —— 上下文

https://www.oschina.net/translate/writing-hadoop-map-reduce-program-in-java?print

Map Reduce是包含两个过程：Map过程和Reduce过程。每一个过程都包含键值对作为输入，程序员可以选择键和值的类型。

Map和Reduce的数据流是这样的：

Input ==> Map ==> Mapper Output ==> Sort and shuffle ==> Reduce ==> Final Output

使用Java编写Hadoop Map Reduce代码

Map Reduce程序需要三个元素：Map， Reduce和运行任务的代码（在这里，我们把它叫做Invoker）。

1）创建Map（可以是任何名字）类和map函数，map函数是在org.apache.hadoop.mapreduce.Mapper.class类中，以抽象方法定义的。

01
import org.apache.hadoop.io.IntWritable;

02
import org.apache.hadoop.io.LongWritable;

03
import org.apache.hadoop.io.Text;

04
import org.apache.hadoop.mapreduce.Mapper;

05
 
06
import java.io.IOException;

07
 
08
public class Map extends Mapper<LongWritable, Text, Text,IntWritable> {

09
    private final static IntWritable one = new IntWritable(1);

10
    private Text word = new Text();

11
    public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException {

12
        word.set(value.toString());

13
        context.write(word, one);

14
    }

15
}

解释：

Mapper类是一个泛型类，带有4个参数（输入的键，输入的值，输出的键，输出的值）。在这里输入的键为LongWritable（hadoop中的Long类型），输入的值为Text（hadoop中的String类型），输出的键为Text（关键字）和输出的值为Intwritable（hadoop中的int类型）。以上所有hadoop数据类型和java的数据类型都很相像，除了它们是针对网络序列化而做的特殊优化。

2）创建Reducer（任何名字）类和reduce函数，reduce函数是在org.apache.hadoop.mapreduce.Reducer.class类中，以抽象方法定义的。

01
import org.apache.hadoop.io.IntWritable;

02
import org.apache.hadoop.io.Text;

03
import org.apache.hadoop.mapreduce.Reducer;

04
 
05
import java.io.IOException;

06
import java.util.Iterator;

07
 
08
public class Reduce extends Reducer<Text, IntWritable, Text,IntWritable> {

09
    @Override

10
    protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

11
        int sum = 0;

12
        for(IntWritable intWritable : values){

13
            sum += intWritable.get();

14
        }

15
        context.write(key, new IntWritable(sum));

16
    }

17
}

解释：

Reducer类是一个泛型类，带有4个参数（输入的键，输入的值，输出的键，输出的值）。在这里输入的键和输入的值必须跟Mapper的输出的类型相匹配，输出的键是Text（关键字），输出的值是Intwritable（出现的次数）

3）我们已经准备号了Map和Reduce的实现类，现在我们需要invoker来配置Hadoop任务，调用Map Reduce程序。

01
import org.apache.hadoop.conf.Configuration;

02
import org.apache.hadoop.fs.Path;

03
import org.apache.hadoop.io.Text;

04
import org.apache.hadoop.mapreduce.Job;

05
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

06
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

07
 
08
public class WordCount{

09
    public static void main(String[] args) throws Exception {

10
        Configuration configuration = new Configuration();

11
        configuration.set("fs.default.name", "hdfs://localhost:10011");

12
        configuration.set("mapred.job.tracker","localhost:10012");

13
 
14
        Job job = new Job(configuration, "Word Count");

15
 
16
        job.setJarByClass(WordCount.class);

17
        job.setMapperClass(Map.class);

18
        job.setReducerClass(Reduce.class);

19
        job.setOutputKeyClass(Text.class);

20
        job.setOutputValueClass(Text.class);

21
        job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);

22
        job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.class);

23
        FileInputFormat.addInputPath(job, new Path(args[0]));

24
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

25
 
26
        //Submit the job to the cluster and wait for it to finish.

27
        System.exit(job.waitForCompletion(true) ? 0 : 1);

28
    }

29
}

4）编译代码：

mkdir WordCount javac -classpath ${HADOOP_HOME}/hadoop-0.20.2+228-core.jar -d WordCount path/*.java

5）创建jar包

jar -cvf ~/WordCount.jar -C WordCount/ .

6）在本地文件系统中创建输入文件

例如：mkdir /home/user1/wordcount/input

cd /wordcount/input gedit file01 gedit file02

7）复制本地的输入文件到HDFS

$HADOOP_HOME/bin/hadoop fs -cp ~/wordcount/input/file01 /home/user1/dfs/input/file01 $HADOOP_HOME/bin/hadoop fs -cp ~/wordcount/input/file02 /home/user1/dfs/input/file02

8) 执行jar包

$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount /home/user1/dfs/input /home/user1/dfs/output

9）执行完毕后，以下的命令是用于查看reduce的输出文件

$HADOOP_HOME/bin/hadoop fs -ls /home/user1/dfs/output/

10）使用如下命令来查看文件：

$HADOOP_HOME/bin/hadoop fs -cat hdfs:///home/user1/dfs/output/part-00000 $HADOOP_HOME/bin/hadoop fs -cat hdfs:///home/user1/dfs/output/part-00001 $HADOOP_HOME/bin/hadoop fs -cat hdfs:///home/user1/dfs/output/part-00002

接下来的文章：在Java Hadoop MapReduce中使用Distributed Cache

本文地址：https://www.oschina.net/translate/writing-hadoop-map-reduce-program-in-java

原文地址：http://randomzone.in/2013/02/14/writing-hadoop-map-reduce-program-in-java/

posted on 2017-05-31 17:33 小西红柿阅读(230) 评论(0) 收藏举报

刷新页面返回顶部

在 Hadoop 上编写 MapReduce 程序 —— 上下文

在 Hadoop 上编写 MapReduce 程序 —— 上下文

导航

公告