Hive调优

Hive调优

一、Hive建表调优

1、分区：大部分按照日期分区：变化快的按照每天的数据放在一个分区里，变化慢的按照每月
   分桶：看数据分布地均不均匀，如果不均匀按照分桶表给它划分一下
   补充：分区、分桶一般用于设计、架构的时候用到(现阶段用不到)
2、工作上一般使用外部表，避免数据误删。建表的时候需要指定location
3、选择适当的文件储存格式及压缩格式
	存储格式：textfile、ORCfile，一般存储格式自带压缩
	压缩格式：zip、gzip、snappy
4、命名要规范
5、数据分层，表分离，但是也不要分的太散
	数据分层、表分离是数仓设计的规范

二、Hive查询优化

1、分区裁剪where过滤，先过滤，后join
2、分区分桶，合并小文件
	在HDFS中小文件过多的话，会产生的map任务过多
3、适当的子查询
mapjoin（1.2以后自动默认启动mapjoin）
maojoin使用场景：大表关联小表的时候，将小表进行广播。小表在HDFS中的限制是25M
左连的时候，大表在左边，小表在右边。
4、排序
order by语句：全局排序
sort by语句：局部排序(单reduce排序)
distribute by语句：分区字段 (涉及计算的时候考虑到，数据按照key分配到不同的Reduce中)
cluster by语句：相当于distribute by、sort by结合使用
			   distribute by Word sort by Word ASC

子查询案例

Hive的 With as 用法

// 之前的写法
select  t.id
        ,t.name
        ,t.clazz
        ,t.score_id
        ,t.score
        ,c.subject_name
from(
    select  a.id
        ,a.name
        ,a.clazz
        ,b.score_id
        ,b.score
    from (
        select  id
                ,name
                ,clazz
        from
        students
    ) a left join (
    select  id
            ,score_id
            ,score
    from score
    ) b
    on a.id = b.id
) t left join (
    select  subject_id
            ,subject_name 
    from subject
) c on t.score_id = c.subject_id
limit 10;

// with as 可以把子查询拿出来，让代码逻辑更加清晰，提高效率
// 必须跟着sql一起使用
with tmp1 as (
    select  id
            ,name
            ,clazz
    from students
), tmp2 as ( 
    select  score_id
            ,id
            ,score
    from
    score
), tmp1Jointmp2 as (
    select  a.id
            ,a.name
            ,a.clazz
            ,b.score_id
            ,b.score
    from tmp1 a
    left join tmp2 b
    on a.id = b.id
), tmp3 as (
select   subject_id
        ,subject_name 
from subject
)select  t.id
        ,t.name
        ,t.clazz
        ,t.score_id
        ,t.score
        ,c.subject_name
from tmp1Jointmp2 t left join tmp3 c
on t.score_id = c.subject_id
limit 10;

三、Hive数据倾斜优化

MR处理的数据都是<k,v>格式，如果k值分布不均，就会导致数据进入Reduce处理的时候，差别会很大，任务进度长时间卡在99%（或100%），查看任务监控页面，

发现只有少量（1个或几个）reduce子任务未完成

1、数据倾斜原因：数据分布不均匀
2、倾斜因素：(1)key分布	(2)shuff
3、解决方案：(1)从数据源头，业务层面进行优化	
		   (2)找到key重复的具体值，进行拆分，hash；然后进行异步求和
		   shuff-->Reduce-->计算

解决数据倾斜案例

//建表
create table data_skew(
    key string
    ,col string
) row format delimited fields terminated by ',';

//加载数据

//随机抽样调查数据，找到数据倾斜的key值(key重复的具体值)
/*
抽样调查：使用distribute by和rand()
key到Reduce中间有个shuff分区过程，默认是hash分区
在Hive中我们可以使用distribute by修改shuff过程的分区规则
抽样就要是随机的，使用rand()随机分布
**/
select * from data_skew distribute by rand() limit 1000;
//通过抽样找到倾斜的key值为’84001‘

//找到倾斜的key值后，将其打散分布
/*
select rand();	        随机产生一个(0,1)区间的小数
select rand()*6;        随机产生一个(0,6)区间的数--------假设将重复的key值拆分为6份
select celi(rand()*6);  向上取整，随机产生一个[1,6]区间的整数
if(key='84401', celi(rand()*6),key)
**/
select key
    ,count(*) as cnt
    ,if(key='84401',ceil(rand()*6),key) as hash_key
from data_skew
group by key,if(key='84401',ceil(rand()*6),key);
//执行结果
84401	164074	1
84401	163577	2
84401	164007	3
84401	163732	4
84401	163303	5
84401	163435	6
84402	400		84402
84403	200		84403
84404	300		84404
84405	100		84405
null	16872	null
//由结果可知，’84001‘的数据被拆散分布为6份，1、2、3、4、5、6会落到不同的Reduce中

//解决了数据倾斜后，再根据key值分组，进行求和
//异步求和的最终代码如下
select t1.key
        ,sum(t1.cnt) as sum_cnt
from(
    select key
            ,count(*) as cnt
            ,if(key='84401',ceil(rand()*6),key) as hash_key
    from data_skew
    group by key,if(key='84401',ceil(rand()*6),key)
) t1 group by t1.key;
//执行结果
84401	982128
84402	400
84403	200
84404	300
84405	100
null	16872

/*
直接分组求count
select key,count(*) from data_skew group by key;
这样没有解决数据倾斜，执行的时候会使任务进度长时间卡在99%（或100%）
**/

//如果同时想要解决84401和null的数据倾斜，代码如下
 select  key
            ,if(key='84401' or key == 'null',ceil(rand()*6),0) as hash_key
           ,count(*) as cnt
    from data_skew 
	group by key,if(key='84401' or key == 'null',ceil(rand()*6),0)

select t1.key
      ,sum(cnt) as sum_snt
from(
    select  key
            ,if(key='84401' or key == 'null',ceil(rand()*6),0) as hash_key
           ,count(*) as cnt
    from data_skew 
	group by key,if(key='84401' or key == 'null',ceil(rand()*6),0)
) t1
group by  t1.key

四、作业优化（一般不使用）

调整mapper和reducer的数量
(1)太多map导致启动产生过多开销，通过控制切片的大小来控制map数量
(2)按照输入数据量大小确定reducer数目，MR默认reducer数目为1个
   我们还可以手动调整reduce的数目：
			set mapred.reduce.tasks = reduce_number
   设置reduce的最大化(阻止资源过度消耗)：
			hive.exec.reducers.max

参数调节（默认的，不需要修改）
(1)set hive.map.aggr = true （hive2默认开启）
(2)Map 端部分聚合，相当于Combiner：
		hive.groupby.skewindata=true

posted @ 2022-02-25 00:00 阿伟宝座阅读(301) 评论(0) 收藏举报

刷新页面返回顶部

阿伟宝座

Hive调优

Hive调优

一、Hive建表调优

二、Hive查询优化

三、Hive数据倾斜优化

四、作业优化（一般不使用）

公告