第十章 Hive调优 【行列过滤 及谓词下推】

1. 列处理 :
说明 : 避免使用 select *,有分区时,要指定分区

2. 行处理 :
关联原则 : 在关联操作时,能先where的尽量先where,减少数据集
3. hive底层优化策略-谓词下推(predicate pushdown) :
什么是谓词下推 ?
1. 将过滤表达式尽可能移动至靠近数据源的位置,以使真正执行时能直接跳过无关的数据
2. 在map完成对数据的过滤

参数设置 :
-- 是否开启 谓词下推(默认开启)
set hive.optimize.ppd=true;

什么情况下会触发 谓词下推?
-- 触发条件
1. 对于 inner join , full outer join :
1. 条件写在on后面,还是where后面,都会触发 谓词下推
2. 对于 A表 left outer join B表 :
1. A表条件写在on后面 且 B表条件 写在where后面
2. A表条件,B表条件 都写在where后面
3. 对于 A表 right outer join B表 :
1. B表条件写在on后面 且 A表条件 写在where后面
2. A表条件,B表条件 都写在where后面
4. 当 条件中 有 不确定函数时,不会触发
不确定函数 : unix_timestamp、rand

4. (谓词下推)触发条件测试
-- 表结构
-- 表结构
A表 : log_orc B表 : log_parquet

字段名称                 字段类型
track_time              string
url                     string
session_id              string
referer                 string
ip                      string
end_user_id             string
city_id                 string
4.1. inner join (自动添加 key != nul)
1. inner join (自动添加 key != nul)
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and B.city_id = '10';

TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)


-- 测试2 : AB 条件在on 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
and  A.ip = '100'
and B.city_id = '10';

TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)

-- 测试3 : inner join 自动空key过滤
with t1 as (
select 'huawei' as channel ,1 as score
union all
select 'oppo' as channel ,2 as score
union all
select null as channel ,3 as score
),
t2 as (
select 'huawei' as channel ,'2021-08-01' as credit_date
union all
select 'oppo' as channel ,'2021-08-01' as credit_date
union all
select null as channel ,'2021-08-03' as credit_date
union all
select 'iphone' as channel ,'2021-08-03' as credit_date
)
select t1.*,t2.*
from t1
inner join t2
on t1.channel = t2.channel;
-- 结果
channel score   channel credit_date
huawei  1       huawei  2021-08-01
oppo    2       oppo    2021-08-01
2. left outer join
2. left outer join
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and B.city_id = '10';

          TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)

-- 测试2 : AB 条件在on 后面 (B条件触发,A条件不触发)

explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
and B.city_id = '100';

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Select Operator

          TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (city_id = '10') (type: boolean)


-- 测试3 : A 条件在on 后面,B 条件在 where 后面 (都会触发)

explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
where B.city_id = '100';


 TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '10') and url is not null) (type: boolean)

 TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '100') and url is not null) (type: boolean)
3. 不触发条件测试(存在不确定函数)
3. 不触发条件测试(存在不确定函数)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and A.city_id = rand()
-- and A.city_id = '1'
;

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)

 

参考 :
https://blog.csdn.net/strongyoung88/article/details/81156271
https://blog.csdn.net/baichoufei90/article/details/85264100



posted @ 2022-02-11 16:07  学而不思则罔!  阅读(622)  评论(0)    收藏  举报