hive 面试题之四
课程选修
现有一份以下格式的数据:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:
数据如下
1,a 1,b 1,c 1,e 2,a 2,c 2,d 2,f 3,a 3,b 3,c 3,e
要求:
编写Hive的HQL语句来实现以下结果: 表中的1表示选修,表中的0表示未选修 id a b c d e f 1 1 1 1 0 1 0 2 1 0 1 1 0 1 3 1 1 1 0 1 0
解题思路:此题目考察的是hive中将列转行的函数此题用到 collect_list,collect_set
1,建表
create table mianshi4(id int,course string) row format delimited fields terminated by ","; 2,数据加载 load data local inpath "/test_datas/mianshi4.txt" into table mianshi4;
3,通过id进行分组
select id,collect_set(course) course from mianshi4 group by id;
得到如下结果
1 ["a","b","c","e"] 2 ["a","c","d","f"] 3 ["a","b","c","e"]
由要求展示的结果可知我们要这样的数据才能进行
1 ["a","b","c","e"] ["a","b","c","e","d","f"] 2 ["a","c","d","f"] ["a","b","c","e","d","f"] 3 ["a","b","c","e"] ["a","b","c","e","d","f"]
于是:有了这样的SQL语句以及这样的结果
select collect_set(course) from mianshi4;
["a","b","c","e","d","f"]
通过分析我们已得到这样的思路 ;要得到以上的结果我们要用笛卡尔积来做于是(第一次运行的时候回提示要进行设置,hive中默认是不能使用笛卡尔积的,要手动进行设置,根据提示进行设置即可)
create table mianshi4_step1 as select a.id id ,a.course acourse ,b.course bcourse from ( select id,collect_set(course) course from mianshi4 group by id) a ,(select collect_set(course) course from mianshi4) b; 1 ["a","b","c","e"] ["a","b","c","e","d","f"] 2 ["a","c","d","f"] ["a","b","c","e","d","f"] 3 ["a","b","c","e"] ["a","b","c","e","d","f"]
最有有了以上结果(我们把结果存入到一个中间表中)
select id, if(array_contains(acourse,bcourse[0]),1,0) a, if(array_contains(acourse,bcourse[1]),1,0) b, if(array_contains(acourse,bcourse[2]),1,0) c, if(array_contains(acourse,bcourse[3]),1,0) e, if(array_contains(acourse,bcourse[4]),1,0) d, if(array_contains(acourse,bcourse[5]),1,0) f from mianshi4_step1; id a b c d e f 1 1 1 1 1 0 0 2 1 0 1 0 1 1 3 1 1 1 1 0 0

浙公网安备 33010602011771号