前言:
这几天遇到一个很诡异的问题,一个三级left outer join的句子,在hive0.9和0.14上的执行结果会不一样。
而且在0.14上通过转换右表连接的顺序可以达到正确输出的目的,但是其中是为什么却不得而知,情况非常
诡异,猜测是编译器编译执行计划有问题!(所谓转换右表连接顺序是指把A left outer join B left outer join C
改成A left outer join C left outer join B,出问题的在B子句中有个聚合出的结果在最终结果中表现不对。)
下面详细介绍下问题:
原语句:
select
A.state_date,
A.customer,
A.channel_2,
A.id,
A.pid,
A.type,
A.pv,
A.uv,
A.visits,
if(C.stay_visits
is null,0,C.stay_visits)
as stay_visits,
A.stay_time,
if(B.bounce
is null,0,B.bounce)
as bounce
from
(select
a.state_date,
a.customer,
b.url
as channel_2,
b.id,
b.pid,
b.type,
count(1)
as pv,
count(distinct
a.gid) uv,
count(distinct
a.session_id) as visits,
sum(a.stay_time)
as stay_time
from
(
select state_date,
customer,
gid,
session_id,
ep,
stay_time
from
bdi_fact.mid_pageview_dt0
where
l_date ='$v_date'
)a
join
(select
l_date as state_date ,
url,
id,
pid,
type,
cid
from
bdi_fact.frequency_channel
where
l_date ='$v_date'
and
type ='2'
and
dr='0'
)b
on
a.customer=b.cid
where
a.ep rlike b.url
group
by a.state_date, a.customer, b.url,b.id,b.pid,b.type
)A
left
outer join
(
select
c.state_date
,
c.customer
,
d.url
as channel_2,
d.id,
sum(pagedepth)
as bounce
from
(
select
t1.state_date
,
t1.customer
,
t1.session_id,
t1.ep,
t2.pagedepth
from
(
select
state_date
,
customer
,
session_id,
exit_url
as ep
from
ods.mid_session_enter_exit_dt0
where
l_date ='$v_date'
)t1
join
(
select
state_date
,
customer
,
session_id,
pagedepth
from
ods.mid_session_action_dt0
where
l_date ='$v_date'
and
pagedepth='1'
)t2
on
t1.customer=t2.customer
and
t1.session_id=t2.session_id
)c
join
(select
*
from
bdi_fact.frequency_channel
where
l_date ='$v_date'
and
type ='2'
and
dr='0'
)d
on
c.customer=d.cid
where
c.ep rlike d.url
group
by c.state_date,c.customer,d.url,d.id
)B
on
A.customer=B.customer
and
A.channel_2=B.channel_2
and
A.id=B.id
left
outer join
(
select
e.state_date,
e.customer,
f.url
as channel_2,
f.id,
f.pid,
f.type,
count(distinct
e.session_id) as stay_visits
from
(
select state_date,
customer,
gid,
session_id,
ep,
stay_time
from
bdi_fact.mid_pageview_dt0
where
l_date ='$v_date'
)e
join
(select
l_date as state_date,
url,
id,
pid,
type,
cid
from
bdi_fact.frequency_channel
where
l_date ='$v_date'
and
type ='2'
and
dr='0'
)f
on
e.customer=f.cid
where
e.ep rlike f.url
and
e.stay_time is not null
and
e.stay_time <>'0'
group
by e.state_date, e.customer, f.url,f.id,f.pid,f.type
)C
on
A.customer=C.customer
and
A.channel_2=C.channel_2
and
A.id=C.id
and
A.pid=C.pid
and
A.type=C.type
|
在B子句中算出的结果B.bounce在最终结果中会显示错误(如正确结果是500,但是却显示是100)
但是,通过调整连接顺序后就很正常了。
打印执行计划出来看:
![]()
可以很明显的看见同一个阶段有一个字段没有输出出来,这个阶段就是B子句的任务。而这个字段就是B.bounce。
这应该就能解释为什么最后结果是不对的
好了,既然知道问题在哪儿了,就来看源码吧!
通过查找ExprNodeColumnDesc.java(负责解析并生成输出字段)类,有以下发现:
![]()
(图中箭头所指是我修改后的代码:
原代码是:if (tabAlias != null && dest.tabAlias != null && !tabAlias.equals(dest.tabAlias)){
return false;
})
解析器在判断是否输出字段时,会判断当前字段是否跟最终表的字段相同(在查询中有很多中间表,如多级连接中):
(以下所说的表都是指表别名)
如果中间表和最终表都不为空且当前表不是最终表的话,返回false。即当前字段跟最终字段不同,需要输出。
但是有个漏洞,就是当最终表为null的时候,这个时候如果是原代码的话,会返回true,字段会被直接丢弃
因为返回true的话编译器判断最终结果中已有这个字段不需要输出。
但是想想,如果最终表的别名为null,当前表的字段应该输出才对啊!不然数据的去处没了,源头也会没了。
仔细想想这应该也是一个写代码时粗心犯下的一个错误吧!(在left outer join多级连接中有可能目标表的别名会为空)
通过修改以上代码并编译后在集群上测试,执行计划输出正常了,数据结果也正常了:
图为patch前的结果(最后一列为B.bounce):
![]()
图为patch后的结果:
![]()