hive 排序

order by,sort by,distribute by和cluster by

1.order by :全局排序,只有一个reduce

2.sort by;局部排序,每一个reduce内部有序,但是全局结果不一定有序

hive (default)> select * from  student;
student.id	student.name
1001	zs
1004	ww
1005	zl
1001	zhangshan
1002	lisi
1003	wangwu

sort by后:每一个reduce内有序
hive (default)> set mapreduce.job.reduces=2;   //2个reduce
hive (default)> select * from  student sort by  id;
student.id	student.name
//reduce 1
1001	zhangshan
1001	zs
1002	lisi
1005	zl
//reduce 2
1003	wangwu
1004	ww

3.distribute by:相当于MapReduce中的partition,即按照给定的列进行分区到不同的reduce

hive (default)> select * from  student;
student.id	student.name
1001	zs
1004	ww
1005	zl
1002	lisi
1001	zhangshan
1003	wangwu

hive (default)> select * from  student distribute by  id;
假如有两个reduce,则reduce中可能为:
reduce 1:
1001	zs
1002	lisi
1001	zhangshan

reduce 2:
1004	ww
1005	zl
1003	wangwu

可以看出,distribute by只是对给定列进行分区,并没有进行排序

4.cluster by:相当于distribute by 和 sort by的集合

以下两条语句是等价的:
select * from student distribute by id sort by id;
select * from student cluster by id;
posted @ 2020-04-26 10:53  枫林晔雪  阅读(224)  评论(0)    收藏  举报