优化

1.fetch抓取

全局查找，字段查找,limit查找都不走mapreduce
set hive.fetch.task.conversion=more;

2.本地模式

小数据集查询，为查询触发执行任务消耗的时间可能会比实际job执行时间大得多
set hive.exec.mode.local.auto=true;

设置local mr的最大输入数据量，默认为128M
SET hive.exec.mode.local.auto.inputbytes.max=xxx;

设置local mr最大输入文件个数，默认为4
set hive.exec.mode.local.auto.input.files.max=4;

3.表的优化

（1）小表大表join（小表在左边）
新版hive已优化，所以小表放左边还是右边没有分别
（2）大表join大表
1.过滤空key
2.空key转换
有时虽然某个key为空对应的数据量很多，但不是异常数据，必须包含在join结果中，此时可疑为表中的key为空的字段赋予一个随机值，使得数据均匀分布在不同的reduce
set mapreduce.job.reduce=5;

select * from nullidtable n
join ori o
on
case when n.id is null then concat("hive",rand()) else n.id end =o.id

3.mapjoin
如果不指定mapjoin或者不符合mapjoin条件，hive解析器会将join转换为combine join，即在reduce端完成join，容易发生数据倾斜
set hive.auto.convert.join=true;
大表小表的阈值设置（小表默认为25M）
SET hive.mapjoin.smalltable.filesizes=xxx;

4.group by

并不是所有的聚合操作都是在reduce端完成，很多聚合操作可以在map端部分聚合，最后在reduce端得出最终结果
开启map端聚合参数的设置.默认为true
set hive.map.aggr=true
设置在map端聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval=xxx
有数据倾斜的时候进行负载均衡
set hive.groupby.skewindata=true

5.count(distinct)

先group by

6.join

先条件过滤后join

7.并行执行

hive会将一个查询转换为多个阶段，可以是mapreduce阶段，抽样阶段，合并阶段，或者limit阶段
默认情况下，hive一次只会执行一个阶段
job的这些阶段并非完全依赖，可以并行执行
set hive.exec.parallel=true
set hive.exec.parallel.thread.number=16

8.严格模式

set hive.mapred.mode=strict

(1) cartesian product
(2)no partition being picked up for a query
(3)comparing bigints and strings
(4)comparing bigints and doubles
(5)order by without limit

9jvm重用

小文件

10.推测执行

有些任务的执行速度明显慢于其他任务，hadoop采用推测执行（speculative execution）机制，根据一定的法则推测出“拖后腿”的任务，被开启备份任务，
让该任务和原始任务同时处理同一份数据，最终选择最先完成任务的计算结果作为最终结果

11.执行计划 explain

posted on 2020-12-19 17:50 happygril3 阅读(162) 评论(0) 收藏举报

刷新页面返回顶部

happygril3

优化