hive 面试题之四

课程选修

现有一份以下格式的数据:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:

数据如下

1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

要求:

编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修

id	a	b	c	d	e	f
1	1	1	1	0	1	0
2	1	0	1	1	0	1
3	1	1	1	0	1	0
解题思路:此题目考察的是hive中将列转行的函数此题用到 collect_list,collect_set

  1,建表

create table mianshi4(id int,course string) 
row format delimited fields terminated by ",";
2,数据加载
load data local inpath "/test_datas/mianshi4.txt" into table mianshi4;

  3,通过id进行分组

select id,collect_set(course) course  from mianshi4 group by id;
得到如下结果
1 ["a","b","c","e"] 2 ["a","c","d","f"] 3 ["a","b","c","e"]

由要求展示的结果可知我们要这样的数据才能进行

1       ["a","b","c","e"]      ["a","b","c","e","d","f"]
2       ["a","c","d","f"]      ["a","b","c","e","d","f"]
3       ["a","b","c","e"]      ["a","b","c","e","d","f"]

于是:有了这样的SQL语句以及这样的结果

select collect_set(course) from mianshi4;
["a","b","c","e","d","f"]

通过分析我们已得到这样的思路  ;要得到以上的结果我们要用笛卡尔积来做于是(第一次运行的时候回提示要进行设置,hive中默认是不能使用笛卡尔积的,要手动进行设置,根据提示进行设置即可)

create table mianshi4_step1 as 
select
a.id id ,a.course acourse ,b.course bcourse
from (
select id,collect_set(course) course from mianshi4 group by id) a ,(select collect_set(course) course from mianshi4) b;

1       ["a","b","c","e"]      ["a","b","c","e","d","f"]
2       ["a","c","d","f"]      ["a","b","c","e","d","f"]
3       ["a","b","c","e"]      ["a","b","c","e","d","f"]

最有有了以上结果(我们把结果存入到一个中间表中)

select id,
if(array_contains(acourse,bcourse[0]),1,0) a,
if(array_contains(acourse,bcourse[1]),1,0) b,
if(array_contains(acourse,bcourse[2]),1,0) c,
if(array_contains(acourse,bcourse[3]),1,0) e,
if(array_contains(acourse,bcourse[4]),1,0) d,
if(array_contains(acourse,bcourse[5]),1,0) f 
from mianshi4_step1;

id      a       b       c       d       e       f
1       1       1       1       1       0       0
2       1       0       1       0       1       1
3       1       1       1       1       0       0

 

posted @ 2019-09-04 20:09  随风无义  阅读(187)  评论(0)    收藏  举报