第十章 Hive调优 【行列过滤 及谓词下推】
1. 列处理 :
说明 : 避免使用 select *,有分区时,要指定分区
2. 行处理 :
关联原则 : 在关联操作时,能先where的尽量先where,减少数据集
3. hive底层优化策略-谓词下推(predicate pushdown) :
什么是谓词下推 ?
1. 将过滤表达式尽可能移动至靠近数据源的位置,以使真正执行时能直接跳过无关的数据
2. 在map完成对数据的过滤
参数设置 :
-- 是否开启 谓词下推(默认开启)
set hive.optimize.ppd=true;
什么情况下会触发 谓词下推?
-- 触发条件
1. 对于 inner join , full outer join :
1. 条件写在on后面,还是where后面,都会触发 谓词下推
2. 对于 A表 left outer join B表 :
1. A表条件写在on后面 且 B表条件 写在where后面
2. A表条件,B表条件 都写在where后面
3. 对于 A表 right outer join B表 :
1. B表条件写在on后面 且 A表条件 写在where后面
2. A表条件,B表条件 都写在where后面
4. 当 条件中 有 不确定函数时,不会触发
不确定函数 : unix_timestamp、rand
4. (谓词下推)触发条件测试
-- 表结构
-- 表结构
A表 : log_orc B表 : log_parquet
字段名称 字段类型
track_time string
url string
session_id string
referer string
ip string
end_user_id string
city_id string
4.1. inner join (自动添加 key != nul)
1. inner join (自动添加 key != nul)
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where A.ip = '100'
and B.city_id = '10';
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
alias: b
Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((city_id = '10') and url is not null) (type: boolean)
-- 测试2 : AB 条件在on 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
and A.ip = '100'
and B.city_id = '10';
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
alias: b
Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((city_id = '10') and url is not null) (type: boolean)
-- 测试3 : inner join 自动空key过滤
with t1 as (
select 'huawei' as channel ,1 as score
union all
select 'oppo' as channel ,2 as score
union all
select null as channel ,3 as score
),
t2 as (
select 'huawei' as channel ,'2021-08-01' as credit_date
union all
select 'oppo' as channel ,'2021-08-01' as credit_date
union all
select null as channel ,'2021-08-03' as credit_date
union all
select 'iphone' as channel ,'2021-08-03' as credit_date
)
select t1.*,t2.*
from t1
inner join t2
on t1.channel = t2.channel;
-- 结果
channel score channel credit_date
huawei 1 huawei 2021-08-01
oppo 2 oppo 2021-08-01
2. left outer join
2. left outer join
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
where A.ip = '100'
and B.city_id = '10';
TableScan
alias: b
Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((city_id = '10') and url is not null) (type: boolean)
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ip = '100') and url is not null) (type: boolean)
-- 测试2 : AB 条件在on 后面 (B条件触发,A条件不触发)
explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
and B.city_id = '100';
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Select Operator
TableScan
alias: b
Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (city_id = '10') (type: boolean)
-- 测试3 : A 条件在on 后面,B 条件在 where 后面 (都会触发)
explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
where B.city_id = '100';
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ip = '10') and url is not null) (type: boolean)
TableScan
alias: b
Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((city_id = '100') and url is not null) (type: boolean)
3. 不触发条件测试(存在不确定函数)
3. 不触发条件测试(存在不确定函数)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where A.ip = '100'
and A.city_id = rand()
-- and A.city_id = '1'
;
TableScan
alias: a
Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ip = '100') and url is not null) (type: boolean)
参考 :
https://blog.csdn.net/strongyoung88/article/details/81156271
https://blog.csdn.net/baichoufei90/article/details/85264100