distinct用group by优化
当数据量非常大,在同一个query中计算多个不相关列的distinct时,往往很容易出现数据倾斜现象,导致运行半天都不能得到结果。
比如以下的SQL语句(a, b, c没有相关性):
select distinct(a), distinct(b), distinct(c) from tableName;
改进方案:把计算的distinct先group by,然后再计算:
select distinct(t.a), distinct(t.b), distinct(t.c) from( select a, b, c from tableName group by a, b, c; )t;
原理:distinct是在reduce里面计算的,再进行merge,而group by的字段会在map这里分N路hash。group by越多,hash分的越随机。
group by中avg以及sum的处理方式:
select count(distinct a) as a, count(distinct b) as b, count(distinct c) as c, count(distinct d) as d, avg(e) as e, sum(f) as f from tablename;
处理方式如下:
select count(distinct a) as a, count(distinct b) as b, count(distinct c) as c, count(distinct d) as d, sum(e)/sum(e_count) as e, // avg的处理 sum(f) as f // sum的处理 from ( select v.a, v.b, v.c, v.d, sum(v.e) as e, sum(v.f) as f, sum(v.e_count) as e_count from( select a, b, c, d, e, f, case when e=null then 0 else 1 end as e_count from tablename )v group by v.a, v.b, v.c, v.d )t;