函数
1.hive函数分类
标准函数 自定义标准函数称之为 UDF 传递一行数据,返回一个结果
聚合函数 自定义聚合函数称之为 UDAF 传递多行数据,返回一个结果 group by sum count
表生成函数 自定义表生成函数称之为 UDTF 传递一行数据,返回多行数据 explode
2.函数帮助文档
SHOW FUNCTIONS;
显示当前环境所有的函数
DESC FUNCTION length;
显示帮助函数文档
DESC FUNCTION EXTENDED length;
显示函数额外信息
3.数学函数
round 求四舍五入
rand 随机
conv 进制转换函数
conv(num, from_base, to_base) num数据 from_base 指定的进制 to_base 转换进制
cast 类型转换 select cast(id as STRING) from student; 把int类型转化成STRING
4.日期函数
from_unixtime 把时间戳转化成字符串
select from_unixtime(1567995900, 'yyyyMMdd HHmmss')
unix_timestamp 把字符串转化时间戳
select unix_timestamp('2011-12-10 13:10:12');
5.条件函数
IF(expr1,expr2,expr3) !!!
expr1 如果expr1为真 输出expr2 否则就输出expr3
SELECT name, age, if(age > 20, 'A', 'B') as type FROM student;
返回第一个非空数据 #
SELECT coalesce(null, 12, 3,14,null);
CASE !!!
给数据分组
SELECT id, CASE
WHEN id <=2 THEN 'a'
WHEN id > 2 AND id <=10 THEN 'b'
ELSE 'c'
END FROM student;
SELECT mycol, sum(id)FROM (SELECT id, (CASE WHEN id <=2 THEN 'a' WHEN id > 2 AND id <=10 THEN 'b' ELSE 'c' END) as mycol FROM student) t GROUP BY mycol
SELECT CASE id WHEN 2 THEN 'a' ELSE 'b' END from student;
//////////////////////////////////////////////////////////////////////
6.字符串处理函数
字符串拼接
concat 拼接任何类型,中间没有分隔符指定
select id, concat(name, age) from student;
concat_ws 只能拼接字符串类型,可以指定分隔符
select concat_ws('_', 'asdfasd', '18') ;
instr(string str, string substr)
select instr('candle', 'dle')
length 字符串大小
lower 小写
lcase 大写
正则表达式 #
regexp_extract 通过正则表达式查找
regexp_extract(str, regexp[, idx])
参数
str 需要处理的字符串
regexp 正则表达式 主要用于匹配
idx 索引
0 整个字符串匹配
1 匹配正则表达式中第一个括号
2 匹配正则表达式第二个括号
select regexp_extract('x=asdfasdf12345asdf', 'x=([a-z]+)([0-9]+)([a-z]+)', 3);
regexp_replace 。。。。。。。替换
select regexp_replace('x=asdfasdf12345asdf3456345', '([0-9]+)','xxxx' );
URL #
parse_url 专门用来解析URL数据
http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere
select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'REF');
[HOST 域名,PATH 路径,QUERY 查询,REF 锚点,PROTOCOL 协议,FILE,AUTHORITY IP地址+端口,USERINFO]
输出结果
hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'HOST');
OK
_c0
www.candle.com
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'PATH');
OK
_c0
/dir1/dir2/dir3/file.html
Time taken: 0.093 seconds, Fetched: 1 row(s)
hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'QUERY');
OK
_c0
key1=value1&key2=value2
Time taken: 0.051 seconds, Fetched: 1 row(s)
hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'QUERY', 'key1');
OK
_c0
value1
Time taken: 0.105 seconds, Fetched: 1 row(s)
hive (hadoop)> select parse_url('http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere', 'REF');
OK
_c0
imhere
JSON #
'{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}'
{
"store":
{
"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
} ,
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}
hive (hadoop)> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}', '$.store.bicycle.price');
OK
_c0
19.95
hive (hadoop)> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}', '$.store.fruit[0].type');
OK
_c0
apple
7.宽表和长表转化 !!!
explode 表生成函数
hive (hadoop)> select explode(city) from student1;
错误:
hive (hadoop)> select id, name, explode(city) from student1;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
数组
hive 中 查询,表生成函数不能直接和普通列直接查询
需要先生成一个临时视图 拼接起来
select id, name, mycity from student1 lateral view explode(city) tmp_view AS mycity;
SELECT id, name, mycity FROM student1 LATERAL VIEW explode(city) 临时视图名称 AS 自定义列名
键值对
键值对 explode生成两列 key value
> select id, name , mykey, myvalue from student3 LATERAL VIEW explode(pairs) tmp_view AS mykey, myvalue;
OK
id name mykey myvalue
1 candle k1 v1
1 candle k2 v2
2 jack k1 v1
2 jack k2 v2
3 tom k1 v1
结构体:
多列
hive (hadoop)> select explode(addr) from student2;
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
注意:
结构体成员不能直接转化成多列,explode只接受arry和map
其实,把结构体每个成员转化成每一行 没有意义
一般再处理结构体采用如下形式
hive (hadoop)> select id,name, addr.city, addr.area, addr.streetid from student2;
OK
id name city area streetid
1 candle shanghai minhang 35
2 jack beijing haidian 100
3 tom hefei shushan 103
一般不采用如下格式
hive (hadoop)> select id, name, info from student2 lateral view explode(array(addr.city, addr.area, addr.streetid)) tmp_view as info;
OK
id name info
1 candle shanghai
1 candle minhang
1 candle 35
2 jack beijing
2 jack haidian
2 jack 100
3 tom hefei
3 tom shushan
3 tom 103
长转宽 聚合函数 group by
collect_list 多行数据 合并到一行 生成数组 list 允许成员重复
select age, collect_list(name), collect_list(id) from student group by age;
age _c1 _c2
18 ["candle","candle"] [1,1]
19 ["aa","jack"] [10,2]
20 ["c2","tom"] [1,4]
100 ["cc"] [12]
200 ["dd"] [13]
collect_set 不允许重复
select age, collect_set(name), collect_set(id) from student group by age;
age _c1 _c2
18 ["candle"] [1]
19 ["aa","jack"] [10,2]
20 ["c2","tom"] [1,4]
100 ["cc"] [12]
200 ["dd"] [13]
select map(id, name) from student; 生成键值对
select struct(id, name, age) from student; 生成结构体
8.窗口函数 !!!
分组排序
1) 创建表格
CREATE TABLE windows
(
id INT,
class STRING,
score INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
2) row_number
partition by 指定分组列
order by 指定排序列
SELECT class, score, row_number() over(partition by class order by score desc) from windows;
排名 不会因为score相同改变次序
class score paiming
class1 78 1
class1 78 2
class1 66 3
class1 66 4
class1 66 5
class1 56 6
class1 56 7
class1 56 8
class1 56 9
class1 56 10
3) rank
排名 如果score重复,排序也重复,下一个名词会跳过重复的部分
SELECT class, score, rank() over(partition by class order by score desc) from windows;
class score _wcol0
class1 78 1
class1 78 1
class1 66 3
class1 66 3
class1 66 3
class1 56 6
class1 56 6
class1 56 6
class1 56 6
class1 56 6
4) dense_rank
如果score重复,排序也重复,下一个名词不会跳过重复的部分
SELECT class, score, dense_rank() over(partition by class order by score desc) from windows;
class score _wcol0
class1 78 1
class1 78 1
class1 66 2
class1 66 2
class1 66 2
class1 56 3
class1 56 3
class1 56 3
class1 56 3
class1 56 3
5) 分组 topn模型
不能直接筛选
hive (hadoop)> SELECT class, score, row_number() over(partition by class order by score desc) as paiming from windows WHERE paiming <= 5;
FAILED: SemanticException [Error 10004]: Line 1:110 Invalid table alias or column reference 'paiming': (possible column names are: id, class, score)
使用子查询
SELECT class, score from
(SELECT class, score, row_number() over(partition by class order by score desc) as paiming from windows) t WHERE paiming <=5;