11-25每日博客

Mapreduce实例——单表join

以本实验的buyer1(buyer_id,friends_id)表为例来阐述单表连接的实验原理。单表连接，连接的是左表的buyer_id列和右表的friends_id列，且左表和右表是同一个表。因此，在map阶段将读入数据分割成buyer_id和friends_id之后，会将buyer_id设置成key，friends_id设置成value，直接输出并将其作为左表；再将同一对buyer_id和friends_id中的friends_id设置成key，buyer_id设置成value进行输出，作为右表。为了区分输出中的左右表，需要在输出的value中再加上左右表的信息，比如在value的String最开始处加上字符1表示左表，加上字符2表示右表。这样在map的结果中就形成了左表和右表，然后在shuffle过程中完成连接。reduce接收到连接的结果，其中每个key的value-list就包含了"buyer_idfriends_id--friends_idbuyer_id"关系。取出每个key的value-list进行解析，将左表中的buyer_id放入一个数组，右表中的friends_id放入一个数组，然后对两个数组求笛卡尔积就是最后的结果了。

Map处理的是一个纯文本文件，Mapper处理的数据是由InputFormat将数据集切分成小的数据集InputSplit，并用RecordReader解析成<key/value>对提供给map函数使用。map函数中用split("\t")方法把每行数据进行截取，并把数据存入到数组arr[]，把arr[0]赋值给mapkey，arr[1]赋值给mapvalue。用两个context的write()方法把数据输出两份，再通过标识符relationtype为1或2对两份输出数据的value打标记。

reduce端在接收map端传来的数据时已经把相同key的所有value都放到一个Iterator容器中values。reduce函数中，首先新建两数组buyer[]和friends[]用来存放map端的两份输出数据。然后Iterator迭代中hasNext()和Next()方法加while循环遍历输出values的值并赋值给record，用charAt(0)方法获取record第一个字符赋值给relationtype，用if判断如果relationtype为1则把用substring(2)方法从下标为2开始截取record将其存放到buyer[]中，如果relationtype为2时将截取的数据放到frindes[]数组中。然后用三个for循环嵌套遍历输出<key,value>，其中key=buyer[m]，value=friends[n]。

代码如下：

package exper;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DanJoin {
    public static class Map extends Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            String[] arr = line.split("   ");
            String mapkey = arr[0];
            String mapvalue = arr[1];
            String relationtype = new String();
            relationtype = "1";
            context.write(new Text(mapkey), new Text(relationtype + "+" + mapvalue));
            //System.out.println(relationtype+"+"+mapvalue);
            relationtype = "2";
            context.write(new Text(mapvalue), new Text(relationtype + "+" + mapkey));
            //System.out.println(relationtype+"+"+mapvalue);
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            int buyernum = 0;
            String[] buyer = new String[20];
            int friendsnum = 0;
            String[] friends = new String[20];
            Iterator ite = values.iterator();
            while (ite.hasNext()) {
                String record = ite.next().toString();
                int len = record.length();
                int i = 2;
                if (0 == len) {
                    continue;
                }
                char relationtype = record.charAt(0);
                if ('1' == relationtype) {
                    buyer[buyernum] = record.substring(i);
                    buyernum++;
                }
                if ('2' == relationtype) {
                    friends[friendsnum] = record.substring(i);
                    friendsnum++;
                }
            }
            if (0 != buyernum && 0 != friendsnum) {
                for (int m = 0; m < buyernum; m++) {
                    for (int n = 0; n < friendsnum; n++) {
                        if (buyer[m] != friends[n]) {
                            context.write(new Text(buyer[m]), new Text(friends[n]));
                        }
                    }
                }
            }
        }
    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        String[] otherArgs = new String[2];
        String InPath="D:\\mapreduce\\4in\\buyer1.txt";
        String OutPath="file:///D:/mapreduce/4out";
        Job job = new Job(conf, "   Table   join");
        job.setJarByClass(DanJoin.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(InPath));
        FileOutputFormat.setOutputPath(job, new Path(OutPath));
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

posted @ 2021-11-25 15:30 软工新人阅读(30) 评论(0) 收藏举报

刷新页面返回顶部

软工新人

11-25每日博客

公告