hsql-蚂蚁森林
一.原始数据

u_001,2020/1/1,10 u_001,2020/1/2,150 u_001,2020/1/2,110 u_001,2020/1/2,10 u_001,2020/1/4,50 u_001,2020/1/4,10 u_001,2020/1/6,45 u_001,2020/1/6,90 u_002,2020/1/1,10 u_002,2020/1/2,150 u_002,2020/1/2,70 u_002,2020/1/2,70 u_002,2020/1/3,30 u_002,2020/1/3,80 u_002,2020/1/4,150 u_002,2020/1/5,101 u_002,2020/1/6,68

p001,梭梭树,17 p002,沙柳,19 p003,樟子树,146 p004,胡杨,215
二.数据导入hive
1. 创建表user_low_carbon
create table user_low_carbon(user_id String,data_dt String,low_carbon int) row format delimited fields terminated by ",";
2. 导入数据
local data local inpath "/opt/module/hive/data/user_low_carbon.txt" into table user_low_carbon;
1.创建plant_carbon表
create table plant_carbon(plant_id String,plant_name string,plant_carbon int) row format delimited fields terminated by ",";
2. 导数据 load data local inpath "/opt/module/hive/data/plant_carbon.txt" into table plant_carbon;
三.问题
1.
假设2020年1月1日开始记录低碳数据(user_low_carbon),假设2020年6月1日之前满足申领条件的用户都申领了一棵 p004-胡杨【用掉了一棵胡杨的低碳量】,剩余的能量全部用来领取 p002-沙柳。
【用户总低碳量减去一棵胡杨所需的低碳量,剩余用来领取沙柳,再除以沙柳的低碳量】
需求:统计在6月1日累计申领 p002-沙柳排名前10的用户信息;以及当前用户比排名靠后的用户多领沙柳的棵数。
得到的统计结果如下表样式:
user_id plant_count less_count(比后一名多领了几棵沙柳)
思路:
1. 年份过滤(where),先求每个用户(分组 group by)的总低碳量(sum聚合),并降序排名(order by desc)前11名(limit 11) 得到表 t1
2.求胡杨需要多少低碳量 -- t2
3.求沙柳需要多少低碳量 -- t3
4.联立 : from t1,t2,t3 ,选出需要的字段
user_id ---用户量
total ---总量
num2 --- 胡杨量
num3 --- 沙柳量
floor((total-num2)/num3) plant_count
5. 开窗 上移lead() ,lag()下移
lead(plant_count,1,0) over(order by plant_count desc) countn
6.相减取排序取前10
代码如下:
select user_id, plant_count, (plant_count-countn) less_count from (select user_id, plant_count, lead(plant_count, 1, 0) over (order by plant_count desc) countn from (select user_id, floor((total - num2) / num3) plant_count from ( select t1.user_id, total, t2.plant_carbon num2, t3.plant_carbon num3 from ( select user_id, sum(low_carbon) as total from user_low_carbon -- where (int(year(regexp_replace(data_dt, "/", "-"))) <= 2020) -- and (int(year(regexp_replace(data_dt, "/", "-"))) > 2019) -- and int(month(regexp_replace(data_dt, "/", "-"))) < 6 where datediff(regexp_replace(data_dt, "/", "-"),regexp_replace('2020/6/1', "/", "-"))<0 group by user_id order by total desc -- 降序 limit 11) t1, (select plant_carbon from plant_carbon where plant_id = "p004") t2, (select plant_carbon from plant_carbon where plant_id = "p002") t3 ) t4 ) t5 limit 10)t6 order by plant_count desc ;
有疑问可以加wx:18179641802,进行探讨