hive函数

函数

查看系统自带的函数

show functions;

显示自带的函数的用法

desc function upper;

详细显示自带的函数的用法

desc function extended upper;

UDF ------> 一进一出 (upper)

UDAF-----> 多进一出（count）

UDTF------> 一进多出（explode）

多指的是输入数据的行数

空字段赋值

函数说明

NVL：给值为 NULL 的数据赋值，它的格式是 NVL( value，default_value)。

它的功能是如果 value 为 NULL，则 NVL 函数返回 default_value 的值，

否则返回 value 的值，如果两个参数都为 NULL ，则返回 NULL。

CASE WHEN THEN ELSE END

原始数据

悟空    A    男
大海    A    男
宋宋    B    男
凤姐    A    女
婷姐    B    女
婷婷    B    女

select 
 dept_id,
 sum(case sex when '男' then 1 else 0 end) maleCount,
 sum(case sex when '女' then 1 else 0 end) faleCount
 from emp_sex
 group by dept_id;

select 
 dept_id,
 sum(if(sex='男',1,0)) maleCount,
 sum(if (sex='女',1,0)) faleCount
from emp_sex
group by dept_id;

CONCAT ,CONCAT_WS,COLLECTION_SET,CONLLECTIONJ_LIST

原始数据：

孙悟空$白羊座$A
大海$射手座$A
宋宋$白羊座$B
猪八戒$白羊座$A
凤姐$射手座$A
苍老师$白羊座$B

##创建表
create table person_info(name string,constellation string,blood_type string) 
 row format delimited fields terminated by '$';
###将数据加载到表中
load data local inpath '/opt/data/person_info.txt' into table person_info;
#查询
select * from person_info;

select 
 concat(constellation,',',blood_type) cb,
 name
 from person_info;

##或
 from person_info;
 select 
 concat_ws(',',constellation,blood_type) cb,
 name
 from person_info;

 select 
  cb,
  collect_set(name)
 from 
  (select 
 concat(constellation,',',blood_type) cb,
 name
 from person_info)t1
 group by cb;

 select 
  cb,
  concat_ws('|',collect_set(name))
 from 
  (select 
 concat(constellation,',',blood_type) cb,
 name
 from person_info)t1
 group by cb;

1.原始数据：

创建表

create table test(
 name string,
 friends array<string>,
 children map<string,int>,
 address struct<street:string,city:string>)
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

导入数据：

load data local inpath '/opt/data/test.txt' into table test;

列转行

explode 炸裂函数

split 切割

函数说明

EXPLODE(col)：将 hive 一列中复杂的 Array 或者 Map 结构拆分成多行。

LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

1. 原始数据：

《疑犯追踪》悬疑,动作,科幻,剧情
《Lie to me》悬疑,警匪,动作,心理,剧情
《战狼 2》战争,动作,灾难

需求：将电影分类中的数组数据展开。结果如下：

《疑犯追踪》悬疑
《疑犯追踪》动作
《疑犯追踪》科幻
《疑犯追踪》剧情
《Lie to me》悬疑
《Lie to me》警匪
《Lie to me》动作
《Lie to me》心理
《Lie to me》剧情
《战狼 2》战争
《战狼 2》动作

过程：

1. 将原数据写入到：

/opt/data/moive_info.txt

2. 创建表movie_info

# 创建表
create table moive_info(name string,category string) 
row format delimited fields terminated by '\t';
# 加载数据到表中
load data local inpath '/opt/data/moive_info.txt' into table moive_info;
# 查看表
select * from moive_info;

3. 切割split

select name,split(category,',') cate_arr from moive_info;

hive (default)> select name,split(category,',') cate_arr from moive_info;
OK
name    cate_arr
《疑犯追踪》    ["悬疑","动作","科幻","剧情"]
《Lie to me》    ["悬疑","警匪","动作","心理","剧情"]
《战狼 2》    ["战争","动作","灾难"]
Time taken: 0.605 seconds, Fetched: 3 row(s)

4. 炸裂：explode

select explode(split(category,',')) from moive_info;

hive (default)> select explode(split(category,',')) from moive_info;
OK
col
悬疑
动作
科幻
剧情
悬疑
警匪
动作
心理
剧情
战争
动作
灾难
Time taken: 0.326 seconds, Fetched: 12 row(s)

5. LATERAL VIEW

select
   name,
   category_name
from 
   moive_info
lateral VIEW explode(split(category,",")) moive_info_tmp as category_name;

hive (default)> select 
              >   name,
              >   category_name
              > from moive_info
              > lateral VIEW explode(split(category,',')) moive_info_tmp as category_name;
OK
name    category_name
《疑犯追踪》    悬疑
《疑犯追踪》    动作
《疑犯追踪》    科幻
《疑犯追踪》    剧情
《Lie to me》    悬疑
《Lie to me》    警匪
《Lie to me》    动作
《Lie to me》    心理
《Lie to me》    剧情
《战狼 2》    战争
《战狼 2》    动作
《战狼 2》    灾难
Time taken: 0.091 seconds, Fetched: 12 row(s)

窗口函数（开窗函数）

相关函数说明

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化。

CURRENT ROW：当前行

n PRECEDING：往前 n 行数据 n FOLLOWING：往后 n 行数据 UNBOUNDED：起点，

UNBOUNDED PRECEDING 表示从前面的起点，

UNBOUNDED FOLLOWING 表示到后面的终点 LAG(col,n,default_val)：往前第 n 行数据 LEAD(col,n, default_val)：往后第 n 行数据

NTILE(n)：把有序窗口的行分发到指定数据的组中，各个组有编号，编号从 1 开始，对于每一行，NTILE 返回此行所属的组的编号。注意：n 必须为 int 类型。

1.原始数据：

hive (default)> create table business(name string,orderdate string,cost int)
> row format delimited fields terminated by ',';

hive (default)> truncate table business;
OK
Time taken: 0.803 seconds
hive (default)> load data local inpath '/opt/data/business.txt' into table business;
Loading data to table default.business
OK
Time taken: 0.424 seconds
hive (default)> select * from business;
OK
business.name    business.orderdate    business.cost
jack    2017-01-01    10
tony    2017-01-02    15
jack    2017-02-03    23
tony    2017-01-04    29
jack    2017-01-05    46
jack    2017-04-06    42
tony    2017-01-07    50
jack    2017-01-08    55
mart    2017-04-08    62
mart    2017-04-09    68
neil    2017-05-10    12
mart    2017-04-11    75
neil    2017-06-12    80
mart    2017-04-13    94
Time taken: 0.223 seconds, Fetched: 14 row(s)

需求

（1）查询在 2017 年 4 月份购买过的顾客及总人数

（2）查询顾客的购买明细及月购买总额

（3）上述的场景, 将每个顾客的 cost 按照日期进行累加

（4）查询每个顾客上次的购买时间

（5）查询前 20%时间的订单信息

select 
  name,
  count(*) num
from business
where substring(orderdate,1,7)='2017-04'
group by name;

查询员工4月份购买商品次数

OK
name num
jack 1
mart 4
Time taken: 47.937 seconds, Fetched: 2 row(s)

（1）查询在 2017 年 4 月份购买过的顾客及总人数

select 
  name,
  count(*) over()
from business
where substring(orderdate,1,7)='2017-04'
group by name;

结果：

（2）查询顾客的购买明细及月购买总额

select 
  name,
  orderdate,
  cost,
  sum(cost) over(partition by month(orderdate))
from business;

先把数据排好序进行分析

select * from business order by name,orderdate;
business.name    business.orderdate    business.cost
jack    2017-01-01    10
jack    2017-01-05    46
jack    2017-01-08    55
jack    2017-02-03    23
jack    2017-04-06    42
mart    2017-04-08    62
mart    2017-04-09    68
mart    2017-04-11    75
mart    2017-04-13    94
neil    2017-05-10    12
neil    2017-06-12    80
tony    2017-01-02    15
tony    2017-01-04    29
tony    2017-01-07    50

方式1：最终语句：

 select 
   name,
   orderdate,
   cost,
   sum(cost) over(partition by name order by orderdate)
 from business;

结果：

方式二起点到当前行

 select 
  name,
  orderdate,
  cost,
  sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row)
 from business;

需求：当前行，前一行，后一行的值

#获取求当前行，前一行和后一行的和
 select 
  name,
  orderdate,
  cost,
  sum(cost) over(partition by name order by orderdate rows between 1 preceding and 1 following)
from business;

（4）查询每个顾客上次的购买时间

sql1 :

select
 name,
 orderdate,
 lag(orderdate,1) over(partition by name order by orderdate)
from business;

结果：

sql2:

# 没有的话就设置默认值 '2017-01-01'
select
 name,
 orderdate,
 lag(orderdate,1,'2017-01-01') over(partition by name order by orderdate)
from business;

sql3:

# 没有的话就设置当前值为默认值 
select
 name,
 orderdate,
 lag(orderdate,1,orderdate) over(partition by name order by orderdate)
from business;

### lead 往上移动

select
name,
orderdate,
lead(orderdate,1,orderdate) over(partition by name order by orderdate)
from business;

分步写sql 的过程

sql

select 
 name,
 orderdate,
 cost,
 ntile(5) over(order by orderdate) groupId
from business;

结果：

查询前 20%时间的订单信息：

select 
 name,
 orderdate,
 cost
from 
(select 
 name,
 orderdate,
 cost,
 ntile(5) over(order by orderdate) groupId
from business)t1
where groupId=1;

Rank

RANK() 排序相同时会重复，总数不会变

DENSE_RANK() 排序相同时会重复，总数会减少

ROW_NUMBER() 会根据顺序计算

##/opt/data/sorce.txt
create table score(name string,subject string,score string)
row format delimited fields terminated by '$';
load data local inpath '/opt/data/sorce.txt' into table score;

select *,rank() over(order by score) from score;

select *,dense_rank() over(order by score) from score;

行号：

select *,row_number() over(order by score) from score;

select *,rank() over(partition by subject order by score desc ) from score;

获取每个学科的前3名（分组topn）

select 
 name,
 subject,
 score
from 
(select *,rank() over(partition by subject order by score desc ) rank_num from score)t1 
where rank_num<=3;

posted @ 2021-03-21 20:34 冰底熊阅读(185) 评论(0) 收藏举报

侠客云