MapReduce原理——Shuffle机制

在Map方法之后，Reduce方法之前的数据处理过程称之为Shuffle.

Map方法输出的数据会获得对应的分区，进入环形缓冲区（缓冲区一半写索引，另一半写数据）。数据达到缓冲区的80%会发生溢写。在溢写之前会对key索引进行快排（按照数据字典），最后对分区进行归并排序。在归并后还可进行对数据的压缩，帮助将数据写入磁盘中。

Partition分区

要求将统计结果按照条件输出到不同的文件中（分区）。比如手机号按照归属地不同身份输出到不同文件中（分区）

源码分析

　　以wordCount

在driver中添加代码

instance.setNumReduceTasks(2);

在mapper中的context.write()方法打断点

进入最后的write()方法里，collector就是环形缓冲区，然后进去参数里的方法

进入获得分区的方法 getPartition()

public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

这个方法是设置的默认分区，根据key的hashCode对ReduceTask个数取模得到的，用户没法控制那个key存储到分区中。

自定义Partitioner步骤：

　　定义类继承Partitioner，重写getPartitioner()方法

　　在job驱动中设置定义的partitioner.

　　设置reducetask的数量。

自定义设置分区案例

package com.rsh.mapreduce.partitioner2;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner extends Partitioner<Text,FlowBean>{
    @Override
    public int getPartition(Text text, FlowBean flowBean, int numPartitions) {

        int partition;
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);

        if("136".equals(prePhone)){
            partition = 0;
        } else if ("137".equals(prePhone)) {
            partition = 1;
        }else if ("138".equals(prePhone)) {
            partition = 2;
        }else if ("139".equals(prePhone)) {
            partition = 3;
        }else {
            partition = 4;
        }

        return partition;
    }

}

package com.rsh.mapreduce.partitioner2;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //获取job对象
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        //关联本driver类
        job.setJarByClass(FlowDriver.class);

        //关联Mapper、Reducer类
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //设置Map的outKV类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //设置程序最终输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        job.setPartitionerClass(ProvincePartitioner.class);

        job.setNumReduceTasks(5);

        //设置程序的输入输出路径
        FileInputFormat.setInputPaths(job,new Path("D:\\hadoopMR\\MRInput\\flow.txt"));
        FileOutputFormat.setOutputPath(job,new Path("D:\\hadoopMR\\MROutput5"));

        //提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

posted @ 2023-02-22 20:16 几人著眼到青衫阅读(106) 评论(0) 收藏举报

刷新页面返回顶部

软件工程

MapReduce原理——Shuffle机制

公告