wordcount代码示例

一、wordCount

　　基类Mapper类和Reducer类中都是只包含四个方法：setup方法，cleanup方法，run方法，map方法。在run方法中调用了上面的三个方法：setup方法，map方法，cleanup方法。其中setup方法和cleanup方法默认是不做任何操作，且它们只被执行一次。

　　setup()：此方法被MapReduce框架仅且执行一次，在执行Map任务前，进行相关变量或者资源的集中初始化工作。若是将资源初始化工作放在方法map()中，导致Mapper任务在解析每一行输入时都会进行资源初始化工作，导致重复，程序运行效率不高！

　　cleanup(),此方法被MapReduce框架仅且执行一次，在执行完毕Map任务后，进行相关变量或资源的释放工作。若是将释放资源工作放入方法map()中，也会导致Mapper任务在解析、处理每一行文本后释放资源，而且在下一行文本解析前还要重复初始化，导致反复重复，程序运行效率不高！

（1）map

package cn.itcast.mapreduce;
 
import java.io.IOException;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
 *
 * Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * KEYIN：是指框架读取到的数据的key类型
 *         在默认的读取数据组件InputFormat下，读取的key是一行文本的偏移量，所以key的类型是long类型的
 * 
 * VALUEIN指框架读取到的数据的value类型
 *         在默认的读取数据组件InputFormat下，读到的value就是一行文本的内容，所以value的类型是String类型的
 * 
 * keyout是指用户自定义逻辑方法返回的数据中key的类型 这个是由用户业务逻辑决定的。
 *         在我们的单词统计当中，我们输出的是单词作为key，所以类型是String
 * 
 * VALUEOUT是指用户自定义逻辑方法返回的数据中value的类型 这个是由用户业务逻辑决定的。
 *         在我们的单词统计当中，我们输出的是单词数量作为value，所以类型是Integer
 * 但是，String ,Long都是jdk中自带的数据类型，在序列化的时候，效率比较低。hadoop为了提高序列化的效率，他就自己自定义了一套数据结构。
 * 所以说在我们的hadoop程序中，如果该数据需要进行序列化（写磁盘，或者网络传输），就一定要用实现了hadoop序列化框架的数据类型
 * 
 * Long------->LongWritable
 * String----->Text
 * Integer---->IntWritable
 * null------->nullWritable
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    /**
     * 这个map方法就是mapreduce程序中被主体程序MapTask所调用的用户业务逻辑方法
     * Maptask会驱动我们的读取数据组件InputFormat去读取数据（KEYIN，VALUEIN）,每读取一个（k,v），他就会传入到这个用户写的map方法中去调用一次
     * 在默认的inputFormat实现中，此处的key就是一行的起始偏移量，value就是一行的内容
     */
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //获取每一行的文本内容
        String lines = value.toString();
        String[] words = lines.split(" ");
        for (String word :words) {
　　　　　　　//处理后的数据写入context，使用方法context.write(key, value);，这里的key和value会传递给下一个过程
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

（2）reduce

package cn.itcast.mapreduce;
 
import java.io.IOException;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
 
/***
 *  reducetask在调用我们的reduce方法
 *  reducetask应该接收到map阶段（前一阶段）中所有maptask输出的数据中的一部分；
 *  （key.hashcode% numReduceTask==本ReduceTask编号）
 *  reducetask将接收到的kv数据拿来处理时，是这样调用我们的reduce方法的：  
 *  先讲自己接收到的所有的kv对按照k分组（根据k是否相同）
 *  然后将一组kv中的k传给我们的reduce方法的key变量，把这一组kv中的所有的v用一个迭代器传给reduce方法的变量values
 * 
 */
 
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int count =0;
        for(IntWritable v :values){
            count += v.get();
        }
        context.write(key, new IntWritable(count));
    }
}

（3）driver

package cn.itcast.mapreduce;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 
/**
 * 本类是客户端用来指定wordcount job程序运行时候所需要的很多参数
 * 
 * 比如：指定哪个类作为map阶段的业务逻辑类  哪个类作为reduce阶段的业务逻辑类
 *         指定用哪个组件作为数据的读取组件  数据结果输出组件
 *         指定这个wordcount jar包所在的路径
 *         以及其他各种所需要的参数
 */
public class WordCountDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://mini1:9000");
//        conf.set("mapreduce.framework.name", "yarn");
//        conf.set("yarn.resourcemanager.hostname", "mini1");
        Job job = Job.getInstance(conf);
        //$JAVA_HOMEbin -cp hdfs-2.3.4.jar:mapreduce-2.0.6.4.jar;        
        //告诉框架，我们的的程序所在jar包的位置
//        job.setJar("/root/wordcount.jar");
        job.setJarByClass(WordCountDriver.class);
        
        //告诉程序，我们的程序所用的mapper类和reducer类是什么
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        //告诉框架，我们程序输出的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //告诉框架，我们程序使用的数据读取组件 结果输出所用的组件是什么
        //TextInputFormat是mapreduce程序中内置的一种读取数据组件  准确的说 叫做 读取文本文件的输入组件
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        //告诉框架，我们要处理的数据文件在那个路劲下
        FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
        //告诉框架，我们的处理结果要输出到什么地方
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
        boolean res = job.waitForCompletion(true);
        System.exit(res?0:1);
        
    }
}

posted on 2020-06-27 12:14 hdc520 阅读(526) 评论(0) 收藏举报

刷新页面返回顶部

hdc520

wordcount代码示例

导航

公告