hive基础操作

1.函数

　　1.聚合函数

　　　　sum(), max(), min(), avg(), count()

　　2.数学函数

　　　　abs() 绝对值

　　　　round() 四舍五入

　　　　ceil() 向上取整

　　　　floor() 向下取整

　　　　power() 幂运算

‘　　　 10%3 ==> 1 取余

　　3.字符串操作

　　　　substr(str，begin，len) ：字符串截取

　　　　length(str) 查看字符串长度

　　　　upper(str) 转大写

　　　　lower(str) 转小写

　　　　initcap(str) 首字母转大写

　　　　instr() 查找字符的位置

　　　　lpad() 左边填充

　　　　rpad() 右边填充

　　　　ltrim() 去除左边字符（包括空格）

　　　　rtrim() 去除右边字符（包括空格）

　　　　trim() 去除两边空格

　　　　replace() 替换字符串的内容

　　　　split() 切割字符串

　　　　concat() 拼接字符串

　　　　concat_ws() 使用传入的连接符拼接字符串或者字符类型的数组

　　　　coalesce() 对空值依次判断，直到第一个非空值，空值包括null，''

　　4.数据类型转换

　　　　cast(column as type)

　　5.数组类型

　　　　array(值1，值2，...) 将不同元素放入到数组中

　　　　size(数组名) 求数组长度

　　　　collect_list(column) 将某列的值转化为数组，有重复值

　　　　collect_set(column) 将某列的值转化为set，去除重复值

　　　　array[index] 获取数组中的元素，从0开始

　　6.map类型

　　　　map(k1,v1,k2,v2,...) 给map添加元素，必须重复存在

　　　　map_keys() 获取map中所有的key，以数组形式返回

　　　　map_values() 获取map中所有的value，以数组形式返回

　　　　size() 求map的元素个数

　　　　str_to_map() 将字符串转化成map，字符串必须以“:”分割k,v；以“，”分割元素，并且成对存在

　　7.爆炸函数(将数组拆分为多行显示)

　　　　1.单列爆炸

　　　　　　select explode(split(mn,',')) from mv;

　　　　　　保留原表中的字段

　　　　　 select 字段名，别名 from 表名 lateral view explode(列) lv as 别名；

　　　　2.多列爆炸

　　　　　　select 字段名，别名1，别名2 from 表名 lateral view posexplode(列) lv as 别名1，lateral view posexplode(列) lv as 别名2

　　8.从json结构的字符串中json提取

　　　　get_json_object(json,提取路径) 提取路径指的是json串中的元素位置

　　9.日期类型

　　　　current_timestamp 获取当前时间

　　　　unix_timesamp() 获取当前时间戳

　　　　add_months(日期，月份数) 月份偏移

　　　　months_between(日期1，日期2) 两个日期间隔的月份

　　　　last_day(日期) 当前日期所在月份的最后一天，即当月天数

　　　　next_day(日期，星期的前两个英文) 指定日期的下一个星期几所在的日期

　　　　day_add(日期，天数值) 天数偏移，可以传负值

　　　　datediff(d1,d2) 两个日期之间的天数

　　10.逻辑函数

　　　　nvl(列名，默认值) 对列的空值进行赋值

　　　　case... when

　　　　if() 判断函数 if(判断条件，真时结果，假时结果) 可以进行嵌套判断

　　11.分析函数

　　　　1.排名函数

　　　　　　row_number() 1234

　　　　　　rank() 1224

　　　　　　dense_rank() 1223

　　　　2.平移函数

　　　　　　lag() 向下移

　　　　　　lead() 向上移

　　12.表连接

　　　　1.hive中表连接时只支持等值连接，mysql，orcale支持非等职连接

　　　　解决方式：将连接条件放到where字句中，当作过滤条件使用

　　　　2.hive子查询也不支持非等值连接

　　　　可以使用in或者使用semi join 实现子查询，并且需要给连接的查询结果取别名

　　　　例如：select * from emp left semi join (select * from emp where ename='SMITH') a on a.deptno=emp.deptno

　　　　3.hive中只有union 和 union all 结果连接，不支持其他连接

　　13.with as ：oracle hive sqlserver mysql 通用的语法

　　　　例如：with t1 as (select 语句)

　　　　　　　　 t2 as (select 语句)

　　　　　　　　　select 语句(连接t1和t2的查询)

　　14.HIVE数据库和oracle数据库,语法不一样的地方：

　　　　1.各自有不同的操作函数,hive里面获取当前时间的操作,字符串的拼接,行转列都不一样。

　　　　2.hive中有复杂的数据类型（array map）

　　　　3.子查询嵌套的语法也不一样。

　　　　4.表连接的条件不一样,hive中join on里面只能写=， > < >= <=！= 都不能用在join on 中。

　　　　5.hive中对结果集进行使用的时候必须要取别名。不取别名会报错。

　　15.hive保存文件的常见几种格式

　　　　1.text是hive默认的表格保存格式，可以通过load来加载数据

　　　　例如：create table emp_text1(

　　　　　　　　empno double,

　　　　　　　　ename string,

　　　　　　　　... ) row format delimited fields terminated by ','

　　　　　　　　stored as textfile;

　　　　　　　　--加载数据

　　　　　　　　load data local inpath '/test/emp_more.txt' into table bigdata.emp_text1;

　　　　2.sequence序列格式，占用空间比text实际要大

　　　　　　　　create table emp_text1(

　　　　　　　　empno double,

　　　　　　　　ename string,

　　　　　　　　... ) row format delimited fields terminated by ','

　　　　　　　　stored as sequencefile;

　　　　　　　　--加载数据

　　　　　　　　insert overwrite table emp_sequence select * from emp_text1;

　　　　3.rc格式：facebook创建一种文件格式，是列存储方式，使用懒加载方式存储和管理数据，可以提高查询速度

　　　　　　　　create table emp_text1(

　　　　　　　　empno double,

　　　　　　　　ename string,

　　　　　　　　... ) row format delimited fields terminated by ','

　　　　　　　　stored as rcfile;

　　　　　　　　--加载数据

　　　　4.orc格式：工作中使用最常见的方式，列存储方式，是rc的优化版，优化文件的压缩和存储

　　　　　　　　create table emp_text1(

　　　　　　　　empno double,

　　　　　　　　ename string,

　　　　　　　　... ) row format delimited fields terminated by ','

　　　　　　　　stored as orcfile;

　　　　5.压缩步骤

　　　　　　　　1.text压缩

　　　　　　　　create table emp_text1(

　　　　　　　　empno double,

　　　　　　　　ename string,

　　　　　　　　... ) row format delimited fields terminated by ','

　　　　　　　　stored as textfile

　　　　　　　　tblproperties("text.compress"="true"); --用来指定支持文本压缩

　　　　　　　　--在数据插入之前对表压缩模式进行设置，开启压缩

　　　　　　　　set hive.exec.compress.output=true;

　　　　　　　　set mapred.output.compress=true;

　　　　　　　　--在linux系统中用gzip压缩文件 gzip /test/emp_more.txt

　　　　　　　　 --导入数据

　　　　　　　　load data local inpath '/test/emp_more.txt.gz' into table bigdata.emp_text_ys;

　　　　　　　　2.sequence支持三种压缩模式 NONE RECORD BLOCK 一般用BLOCK

　　　　　　　　--开启压缩模式

　　　　　　　　set hive.exec.compress.output=true;

　　　　　　　　set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

　　　　　　　　set mapred.output.compression.type=BLOCK;

　　　　　　　　set mapred.output.compress=true;

　　　　　　　　set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

　　　　　　　　--导入

　　　　　　　　insert overwrite table emp_sequence_ys select * from emp_text_ys;

　　　　　　　　3.rc
　　　　　　　　set hive.exec.compress.output=true;

　　　　　　　　set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

　　　　　　　　set mapred.output.compress=true;

　　　　　　　　set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

　　　　　　　　4.orc ： zlib ，snappy压缩模式

　　　　　　　　--只需要创建表时指定

　　　　　　　　tblproperties("orc.compress"="Zlib");

　　　　　　　　5.平时用的存储格式只有两种：

　　　　　　　　text:如果表格的文件需要经常被导入导出，表格的数据也不是很大，就使用text存储格式，gzip进行压缩。

　　　　　　　　orc：当表格的数据量非常大的时候，使用其他格式的表格读取比较慢，那么我们就使用orc模式。

　　　　　　　　6.hive中如果要进行delete和update操作，至少需要满足三个条件：

　　　　　　　　1.表格必须是一个分桶表

　　　　　　　　2.表格的存储模式必须是orc模式

　　　　　　　　3.必须支持事务

posted on 2022-12-27 16:18 银光短战棍阅读(201) 评论(0) 收藏举报

刷新页面返回顶部

公告