hive中count(distinct) 的原理
参考博客:
https://blog.csdn.net/oracle8090/article/details/80760233
https://www.cnblogs.com/ling1995/p/7339424.html(看我)
-- 这种set mapreduce.job.reduces=5; 中的5是生效的(局部去重);
SELECT dept_num
,COUNT(DISTINCT name) AS name_cnt
FROM emp_ct
GROUP BY dept_num;
-- 这种set mapreduce.job.reduces=5; 中的5不会生效的,只有一个reducer task(全局去重);
select count(distinct dept_num)
from emp_ct;
count(distinct id)的原理

count(distinct id)从执行计划上面来看:只有一个reducer任务(即使你设置reducer任务为100个,实际上也没有用),所有的id都
会聚集到同一个reducer任务进行去重然后在聚合,这非常容易造成数据倾斜.
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp_ct
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: dept_num (type: int)
outputColumnNames: _col0
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0)
mode: complete
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
运行示例:注意设置的reducer任务数量实际上是不生效的。
hive> set mapreduce.job.reduces=5;
hive>
> select count(distinct dept_num)
> from emp_ct;
Query ID = mart_fro_20200320233947_4f60c190-4967-4da6-bf3e-97db786fbc6c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Start submit job !
Start GetSplits
GetSplits finish, it costs : 32 milliseconds
Submit job success : job_1584341089622_358496
Starting Job = job_1584341089622_358496, Tracking URL = http://BJHTYD-Hope-25-11.hadoop.jd.local:50320/proxy/application_1584341089622_358496/
Kill Command = /data0/hadoop/hadoop_2.100.31_2019090518/bin/hadoop job -kill job_1584341089622_358496
Hadoop job(job_1584341089622_358496) information for Stage-1: number of mappers: 2; number of reducers: 1
2020-03-20 23:39:58,215 Stage-1(job_1584341089622_358496) map = 0%, reduce = 0%
2020-03-20 23:40:09,628 Stage-1(job_1584341089622_358496) map = 50%, reduce = 0%, Cumulative CPU 2.74 sec
2020-03-20 23:40:16,849 Stage-1(job_1584341089622_358496) map = 100%, reduce = 0%, Cumulative CPU 7.43 sec
2020-03-20 23:40:29,220 Stage-1(job_1584341089622_358496) map = 100%, reduce = 100%, Cumulative CPU 10.64 sec
MapReduce Total cumulative CPU time: 10 seconds 640 msec
Stage-1 Elapsed : 40533 ms job_1584341089622_358496
Ended Job = job_1584341089622_358496
MapReduce Jobs Launched:
Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 10.64 sec HDFS Read: 0.000 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 40s533ms job_1584341089622_358496
Total MapReduce CPU Time Spent: 10s640ms
Total Map: 2 Total Reduce: 1
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
3
Time taken: 43.025 seconds, Fetched: 1 row(s)
count(distinct id)的解决方案
看我:https://zhuanlan.zhihu.com/p/400830935
https://baijiahao.baidu.com/s?id=1709215794913205586&wfr=spider&for=pc
该怎么解决这个问题呢?实际上解决方法非常巧妙:
我们利用Hive对嵌套语句的支持,将原来一个MapReduce作业转换为两个作业,在第一阶段选出全部的非重复id,在第二阶段再对
这些已消重的id进行计数。这样在第一阶段我们可以通过增大Reduce的并发数,并发处理Map输出。在第二阶段,由于id已经消重,
因此COUNT(*)操作在Map阶段不需要输出原id数据,只输出一个合并后的计数即可。这样即使第二阶段Hive强制指定一个Reduce Task,
极少量的Map输出数据也不会使单一的Reduce Task成为瓶颈。改进后的SQL语句如下:

查看一下执行计划:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp_ct
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: dept_num (type: int)
outputColumnNames: dept_num
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: dept_num (type: int)
sort order: +
Map-reduce partition columns: dept_num (type: int)
Statistics: Num rows: 42 Data size: 171 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int)
mode: complete
outputColumnNames: _col0
Statistics: Num rows: 21 Data size: 85 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
sort order:
Statistics: Num rows: 21 Data size: 85 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: int)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: complete
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
具体看一下执行结果:注意看reducer任务的数量,第一个reducer任务是5个,第二个是1个.
hive> set mapreduce.job.reduces=5;
hive>
> select count(dept_num)
> from (
> select distinct dept_num
> from emp_ct
> ) t1;
Query ID = mart_fro_20200320234453_68ad3780-c3e5-44bc-94df-58a8f2b01f59
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Start submit job !
Start GetSplits
GetSplits finish, it costs : 13 milliseconds
Submit job success : job_1584341089622_358684
Starting Job = job_1584341089622_358684, Tracking URL = http://BJHTYD-Hope-25-11.hadoop.jd.local:50320/proxy/application_1584341089622_358684/
Kill Command = /data0/hadoop/hadoop_2.100.31_2019090518/bin/hadoop job -kill job_1584341089622_358684
Hadoop job(job_1584341089622_358684) information for Stage-1: number of mappers: 2; number of reducers: 5
2020-03-20 23:45:02,920 Stage-1(job_1584341089622_358684) map = 0%, reduce = 0%
2020-03-20 23:45:23,533 Stage-1(job_1584341089622_358684) map = 50%, reduce = 0%, Cumulative CPU 3.48 sec
2020-03-20 23:45:25,596 Stage-1(job_1584341089622_358684) map = 100%, reduce = 0%, Cumulative CPU 7.08 sec
2020-03-20 23:45:32,804 Stage-1(job_1584341089622_358684) map = 100%, reduce = 20%, Cumulative CPU 9.43 sec
2020-03-20 23:45:34,861 Stage-1(job_1584341089622_358684) map = 100%, reduce = 40%, Cumulative CPU 12.39 sec
2020-03-20 23:45:36,923 Stage-1(job_1584341089622_358684) map = 100%, reduce = 80%, Cumulative CPU 18.47 sec
2020-03-20 23:45:40,011 Stage-1(job_1584341089622_358684) map = 100%, reduce = 100%, Cumulative CPU 23.23 sec
MapReduce Total cumulative CPU time: 23 seconds 230 msec
Stage-1 Elapsed : 46404 ms job_1584341089622_358684
Ended Job = job_1584341089622_358684
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Start submit job !
Start GetSplits
GetSplits finish, it costs : 47 milliseconds
Submit job success : job_1584341089622_358729
Starting Job = job_1584341089622_358729, Tracking URL = http://BJHTYD-Hope-25-11.hadoop.jd.local:50320/proxy/application_1584341089622_358729/
Kill Command = /data0/hadoop/hadoop_2.100.31_2019090518/bin/hadoop job -kill job_1584341089622_358729
Hadoop job(job_1584341089622_358729) information for Stage-2: number of mappers: 5; number of reducers: 1
2020-03-20 23:45:48,353 Stage-2(job_1584341089622_358729) map = 0%, reduce = 0%
2020-03-20 23:46:05,846 Stage-2(job_1584341089622_358729) map = 20%, reduce = 0%, Cumulative CPU 2.62 sec
2020-03-20 23:46:06,873 Stage-2(job_1584341089622_358729) map = 60%, reduce = 0%, Cumulative CPU 8.49 sec
2020-03-20 23:46:08,931 Stage-2(job_1584341089622_358729) map = 80%, reduce = 0%, Cumulative CPU 11.53 sec
2020-03-20 23:46:09,960 Stage-2(job_1584341089622_358729) map = 100%, reduce = 0%, Cumulative CPU 15.23 sec
2020-03-20 23:46:35,639 Stage-2(job_1584341089622_358729) map = 100%, reduce = 100%, Cumulative CPU 20.37 sec
MapReduce Total cumulative CPU time: 20 seconds 370 msec
Stage-2 Elapsed : 54552 ms job_1584341089622_358729
Ended Job = job_1584341089622_358729
MapReduce Jobs Launched:
Stage-1: Map: 2 Reduce: 5 Cumulative CPU: 23.23 sec HDFS Read: 0.000 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 46s404ms job_1584341089622_358684
Stage-2: Map: 5 Reduce: 1 Cumulative CPU: 20.37 sec HDFS Read: 0.000 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 54s552ms job_1584341089622_358729
Total MapReduce CPU Time Spent: 43s600ms
Total Map: 7 Total Reduce: 6
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
3
Time taken: 103.692 seconds, Fetched: 1 row(s)
这个解决方案有点类似于set hive.groupby.skew.indata参数的作用!
实际测试:
select count(distinct dept_num) from emp_ct select count(*) from ( select distinct dept_num from emp_ct )

浙公网安备 33010602011771号