hive 排序
order by,sort by,distribute by和cluster by
1.order by :全局排序,只有一个reduce
2.sort by;局部排序,每一个reduce内部有序,但是全局结果不一定有序
hive (default)> select * from student;
student.id student.name
1001 zs
1004 ww
1005 zl
1001 zhangshan
1002 lisi
1003 wangwu
sort by后:每一个reduce内有序
hive (default)> set mapreduce.job.reduces=2; //2个reduce
hive (default)> select * from student sort by id;
student.id student.name
//reduce 1
1001 zhangshan
1001 zs
1002 lisi
1005 zl
//reduce 2
1003 wangwu
1004 ww
3.distribute by:相当于MapReduce中的partition,即按照给定的列进行分区到不同的reduce
hive (default)> select * from student;
student.id student.name
1001 zs
1004 ww
1005 zl
1002 lisi
1001 zhangshan
1003 wangwu
hive (default)> select * from student distribute by id;
假如有两个reduce,则reduce中可能为:
reduce 1:
1001 zs
1002 lisi
1001 zhangshan
reduce 2:
1004 ww
1005 zl
1003 wangwu
可以看出,distribute by只是对给定列进行分区,并没有进行排序
4.cluster by:相当于distribute by 和 sort by的集合
以下两条语句是等价的:
select * from student distribute by id sort by id;
select * from student cluster by id;

浙公网安备 33010602011771号