Hive 中的 LEFT SEMI JOIN 与 JOIN ON

hive 的 join 类型有好几种，其实都是把 MR 中的几种方式都封装实现了，其中 join on、left semi join 算是里边具有代表性，且使用频率较高的 join 方式。

1、联系

他们都是 hive join 方式的一种，join on 属于 common join（shuffle join/reduce join），而 left semi join 则属于 map join（broadcast join）的一种变体，从名字可以看出他们的实现原理有差异。

2、区别

（1）Semi Join，也叫半连接，是从分布式数据库中借鉴过来的方法。它的产生动机是：对于reduce side join，跨机器的数据传输量非常大，这成了join操作的一个瓶颈，如果能够在map端过滤掉不会参加join操作的数据，则可以大大节省网络IO，提升执行效率。
实现方法很简单：选取一个小表，假设是File1，将其参与join的key抽取出来，保存到文件File3中，File3文件一般很小，可以放到内存中。在map阶段，使用DistributedCache将File3复制到各个TaskTracker上，然后将File2中不在File3中的key对应的记录过滤掉，剩下的reduce阶段的工作与reduce side join相同。
由于 hive 中没有 in/exist 这样的子句（新版将支持），所以需要将这种类型的子句转成 left semi join。left semi join 是只传递表的 join key 给 map 阶段 , 如果 key 足够小还是执行 map join, 如果不是则还是 common join。关于 common join（shuffle join/reduce join）的原理请参考文末 refer。

（2）left semi join 子句中右边的表只能在 ON 子句中设置过滤条件，在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

（3）对待右表中重复key的处理方式差异：因为 left semi join 是 in(keySet) 的关系，遇到右表重复记录，左表会跳过，而 join on 则会一直遍历。

最后的结果是这会造成性能，以及 join 结果上的差异。

（4）left semi join 中最后 select 的结果只许出现左表，因为右表只有 join key 参与关联计算了，而 join on 默认是整个关系模型都参与计算了。

3、两种 join 的“坑”

由于HIVE中都是等值连接，在JOIN使用的时候，有两种写法在理论上是可以达到相同的效果的，但是由于实际情况的不一样，子表中数据的差异导致结果也不太一样。

写法一： left semi join

 1 select
 2          a.bucket_id
 3         ,a.search_type
 4         ,a.level1
 5         ,a.name1
 6         ,a.level2
 7         ,a.name2
 8         ,cast((a.alipay_fee) as double) as zhuliu_alipay
 9         ,cast(0 as double) as total_alipay
10         from tmall_data_fdi_search_zhuliu_alipay_cocerage_bucket_1 a
11      left semi join
12      tmall_data_fdi_dim_main_auc b
13      on (a.level2 = b.cat_id2
14          and a.brand_id = b.brand_id
15          and b.cat_id2 > 0
16          and b.brand_id > 0
17          and b.max_price = 0
18      )

结果是 3121 条

写法二： join on

 1 select
 2      a.bucket_id 
 3     ,a.search_type 
 4     ,a.level1 
 5     ,a.name1 
 6     ,a.level2 
 7     ,a.name2 
 8     ,cast((a.alipay_fee) as double) as zhuliu_alipay 
 9     ,cast(0 as double) as total_alipay
10 from tmall_data_fdi_search_zhuliu_alipay_cocerage_bucket_1 a
11 join   tmall_data_fdi_dim_main_auc b
12 on       (a.level2 = b.cat_id2
13      and a.brand_id = b.brand_id)
14 where  b.cat_id2 > 0
15      and b.brand_id > 0
16      and b.max_price = 0

结果是 3142 条

这两种写法带来的值居然不是相等的，我一直以为理解这两种方式的写法是一样的，但是统计的结果却是不一样的。
经过一层一层的查找，发现是由于子表（tmall_data_fdi_dim_main_auc）中存在重复的数据，当使用JOIN ON的时候，A,B表会关联出两条记录，应为ON上的条件符合；
而是用LEFT SEMI JOIN 当A表中的记录，在B表上产生符合条件之后就返回，不会再继续查找B表记录了，所以如果B表有重复，也不会产生重复的多条记录。

大多数情况下 JOIN ON 和 left semi on 是对等的，但是在上述情况下会出现重复记录，导致结果差异，所以大家在使用的时候最好能了解这两种方式的原理，避免掉“坑”。

其他参考：

demo1:

What is difference between natural join and semi join?

The result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names.

The result of the semijoin is only the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names.

The point is that natural join is a set of all combinations and semijoin is only the tuples from the first relation not a combination between the two.

 1 R1 (natural join) R2 =
 2 
 3 A B C
 4 
 5 
 6 1 2 3
 7 
 8 1 3 4
 9 
10 whereas R1(semijoin) R2 =
11 
12 A B
13 
14 1 2
15 
16 1 3
17 
18 So in a way semijoin selects and returns a table of only the tuples from R1 that have an equal attribute with R2

大多数人都以为是才智成就了科学家，他们错了，是品格。---爱因斯坦

转载于：https://www.cnblogs.com/wqbin/p/11023008.html

posted @ 2021-07-17 17:09 温家三哥阅读(431) 评论(0) 收藏举报

刷新页面返回顶部

温家三哥

学我所学，爱我所学，人生旅途，面朝大海，春暖花开。

Hive 中的 LEFT SEMI JOIN 与 JOIN ON

Hive 中的 LEFT SEMI JOIN 与 JOIN ON

1、联系

2、区别

3、两种 join 的“坑”

写法一： left semi join

写法二： join on

What is difference between natural join and semi join?

公告