MapReduce实现的Join

MapReduce Join

对两份数据data1和data2进行关键词连接是一个很通用的问题，如果数据量比较小，可以在内存中完成连接。

如果数据量比较大，在内存进行连接操会发生OOM。mapreduce join可以用来解决大数据的连接。

1 思路

1.1 reduce join

在map阶段, 把关键字作为key输出，并在value中标记出数据是来自data1还是data2。因为在shuffle阶段已经自然按key分组，reduce阶段，判断每一个value是来自data1还是data2,在内部分成2组，做集合的乘积。

这种方法有2个问题：

1, map阶段没有对数据瘦身，shuffle的网络传输和排序性能很低。

2, reduce端对2个集合做乘积计算，很耗内存，容易导致OOM。

1.2 map join

两份数据中，如果有一份数据比较小，小数据全部加载到内存，按关键字建立索引。大数据文件作为map的输入文件，对map()函数每一对输入，都能够方便地和已加载到内存的小数据进行连接。把连接结果按key输出，经过shuffle阶段，reduce端得到的就是已经按key分组的，并且连接好了的数据。

这种方法，要使用hadoop中的DistributedCache把小数据分布到各个计算节点，每个map节点都要把小数据库加载到内存，按关键字建立索引。

这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

1.3 使用内存服务器，扩大节点的内存空间

针对map join，可以把一份数据存放到专门的内存服务器，在map()方法中，对每一个<key,value>的输入对，根据key到内存服务器中取出数据，进行连接

1.4 使用BloomFilter过滤空连接的数据

对其中一份数据在内存中建立BloomFilter，另外一份数据在连接之前，用BloomFilter判断它的key是否存在，如果不存在，那这个记录是空连接，可以忽略。

1.5 使用mapreduce专为join设计的包

在mapreduce包里看到有专门为join设计的包，对这些包还没有学习，不知道怎么使用，只是在这里记录下来，作个提醒。

jar： mapreduce-client-core.jar

package： org.apache.hadoop.mapreduce.lib.join

2 实现map join

相对而言，map join更加普遍，下面的代码使用DistributedCache实现map join

2.1 背景

有客户数据customer和订单数据orders。

customer

客户编号	姓名	地址	电话
1	hanmeimei	ShangHai	110
2	leilei	BeiJing	112
3	lucy	GuangZhou	119

** order**

订单编号	客户编号	其它字段被忽略
1	1	50
2	1	200
3	3	15
4	3	350
5	3	58
6	1	42
7	1	352
8	2	1135
9	2	400
10	2	2000
11	2	300

要求对customer和orders按照客户编号进行连接，结果要求对客户编号分组，对订单编号排序，对其它字段不作要求

客户编号	订单编号	订单金额	姓名	地址	电话
1	1	50	hanmeimei	ShangHai	110
1	2	200	hanmeimei	ShangHai	110
1	6	42	hanmeimei	ShangHai	110
1	7	352	hanmeimei	ShangHai	110
2	8	1135	leilei	BeiJing	112
2	9	400	leilei	BeiJing	112
2	10	2000	leilei	BeiJing	112
2	11	300	leilei	BeiJing	112
3	3	15	lucy	GuangZhou	119
3	4	350	lucy	GuangZhou	119
3	5	58	lucy	GuangZhou	119

在提交job的时候，把小数据通过DistributedCache分发到各个节点。
map端使用DistributedCache读到数据，在内存中构建映射关系--如果使用专门的内存服务器，就把数据加载到内存服务器，map()节点可以只保留一份小缓存；如果使用BloomFilter来加速，在这里就可以构建；
map()函数中，对每一对<key,value>，根据key到第2)步构建的映射里面中找出数据，进行连接，输出。

2.2 程序实现

public class Join extends Configured implements Tool {
// customer文件在hdfs上的位置。
// TODO: 改用参数传入
private static final String CUSTOMER_CACHE_URL = "hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt";
private static class CustomerBean {
private int custId;
private String name;
private String address;
private String phone;

public CustomerBean() {}

public CustomerBean(int custId, String name, String address,
String phone) {
super();
this.custId = custId;
this.name = name;
this.address = address;
this.phone = phone;
}

public int getCustId() {
return custId;
}

public String getName() {
return name;
}

public String getAddress() {
return address;
}

public String getPhone() {
return phone;
}
}

private static class CustOrderMapOutKey implements WritableComparable {
private int custId;
private int orderId;

public void set(int custId, int orderId) {
this.custId = custId;
this.orderId = orderId;
}

public int getCustId() {
return custId;
}

public int getOrderId() {
return orderId;
}

@Override
public void write(DataOutput out) throws IOException {
out.writeInt(custId);
out.writeInt(orderId);
}

@Override
public void readFields(DataInput in) throws IOException {
custId = in.readInt();
orderId = in.readInt();
}

@Override
public int compareTo(CustOrderMapOutKey o) {
int res = Integer.compare(custId, o.custId);
return res == 0 ? Integer.compare(orderId, o.orderId) : res;
}

@Override
public boolean equals(Object obj) {
if (obj instanceof CustOrderMapOutKey) {
CustOrderMapOutKey o = (CustOrderMapOutKey)obj;
return custId == o.custId && orderId == o.orderId;
} else {
return false;
}
}

@Override
public String toString() {
return custId + "\t" + orderId;
}
}

private static class JoinMapper extends Mapper<LongWritable, Text, CustOrderMapOutKey, Text> {
private final CustOrderMapOutKey outputKey = new CustOrderMapOutKey();
private final Text outputValue = new Text();

/**
* 在内存中customer数据
*/
private static final Map<Integer, CustomerBean> CUSTOMER_MAP = new HashMap<Integer, Join.CustomerBean>();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

// 格式: 订单编号客户编号订单金额
String[] cols = value.toString().split("\t");
if (cols.length < 3) {
return;
}

int custId = Integer.parseInt(cols[1]); // 取出客户编号
CustomerBean customerBean = CUSTOMER_MAP.get(custId);

if (customerBean == null) { // 没有对应的customer信息可以连接
return;
}

StringBuffer sb = new StringBuffer();
sb.append(cols[2])
.append("\t")
.append(customerBean.getName())
.append("\t")
.append(customerBean.getAddress())
.append("\t")
.append(customerBean.getPhone());

outputValue.set(sb.toString());
outputKey.set(custId, Integer.parseInt(cols[0]));

context.write(outputKey, outputValue);
}

@Override
protected void setup(Context context)
throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(URI.create(CUSTOMER_CACHE_URL), context.getConfiguration());
FSDataInputStream fdis = fs.open(new Path(CUSTOMER_CACHE_URL));

BufferedReader reader = new BufferedReader(new InputStreamReader(fdis));
String line = null;
String[] cols = null;

// 格式：客户编号姓名地址电话
while ((line = reader.readLine()) != null) {
cols = line.split("\t");
if (cols.length < 4) { // 数据格式不匹配，忽略
continue;
}

CustomerBean bean = new CustomerBean(Integer.parseInt(cols[0]), cols[1], cols[2], cols[3]);
CUSTOMER_MAP.put(bean.getCustId(), bean);
}
}
}

/**
* reduce
* @author Ivan
*
/
private static class JoinReducer extends Reducer<CustOrderMapOutKey, Text, CustOrderMapOutKey, Text> {
@Override
protected void reduce(CustOrderMapOutKey key, Iterable values, Context context)
throws IOException, InterruptedException {
// 什么事都不用做，直接输出
for (Text value : values) {
context.write(key, value);
}
}
}
/*
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
if (args.length < 2) {
new IllegalArgumentException("Usage: ");
return;
}

ToolRunner.run(new Configuration(), new Join(), args);
}

@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, Join.class.getSimpleName());
job.setJarByClass(SecondarySortMapReduce.class);

// 添加customer cache文件
job.addCacheFile(URI.create(CUSTOMER_CACHE_URL));

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// map settings
job.setMapperClass(JoinMapper.class);
job.setMapOutputKeyClass(CustOrderMapOutKey.class);
job.setMapOutputValueClass(Text.class);

// reduce settings
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(CustOrderMapOutKey.class);
job.setOutputKeyClass(Text.class);

boolean res = job.waitForCompletion(true);

return res ? 0 : 1;
}
}

运行环境

操作系统: Centos 6.4
Hadoop: Apache Hadoop-2.5.0

客户数据文件在hdfs上的位置硬编码为
hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt，运行程序之前先把客户数据上传到这个位置。

程序运行结果

@Hadoop中两表JOIN的处理方法

posted @ 2016-07-23 12:19 Ivan.Jiang 阅读(10032) 评论(0) 收藏举报

刷新页面返回顶部

小小