hive:
================
数据仓库,用于分析海量数据
底层使用hadoop
hive一般不支持实务操作,行级更新,必要时候也能支持事务性
数据仓库和关系型数据库适用场景:
数仓:OLAP
关系型数据库:OLTP //事务
Array: select array[0] from t;
Map: select map['height'] from t;
Struct: select struct.DB from t;
Union select union['height'][0] from t
hive操作流程:
=======================
用户在界面输入HQL(hiveQL)语句,被编译器解释为MR作业,通过执行引擎传给hadoop
hive1 //推荐使用MR作为执行引擎
hive2 //推荐使用spark作为执行引擎
hive的数据结构:
==========================
元数据:mysql
真实数据:HDFS
表:文件夹
数据库:文件夹
数据:HDFS文件,默认文本文件
hive:\0001
============================
create table t1(id int, name string) row format delimited
fields terminated by '\t'
lines terminated by '\n'
collection items terminated by ','
map keys terminated by ':'
stored as textfile;
hive脚本分析:
================================
hive
--version
--service
--rcfilecat
--orcfiledump
--llapdump
--help
--debug
二级命令使用帮助
hive --service serviceName --help
hive服务:hive --serveice
====================================
beeline 二代用户接口
cli 一代用户接口
help
hiveserver2 二代hive服务
version
beeline和hiveserver2
===================================
hiveserver2 //hive的jdbc服务
hive --service hiveserver2 //启动hiveserver2
hiveserver2 //启动hiveserver2
s101:10002 //hiveserver2的webUI
s101:10000 //hiveserver2的RPC地址
beeline:hive2新客户端
出现 User: centos is not allowed to impersonate anonymous (state=08S01,code=0)问题:
===========================================================================================
解决方案:在core-site.xml中添加如下配置
<property>
<name>hadoop.proxyuser.centos.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.centos.groups</name>
<value>*</value>
</property>
添加之后,分发到其他机器,重启hadoop
问题2:配置文件后,创建异常· //anonymous用户没有写权限的问题
1、第一种方案,修改hive-site.xml配置文件,将其改为centos
<property>
<name>hive.server2.thrift.client.user</name>
<value>centos</value>
<description>Username to use against thrift client</description>
</property>
2、第二种修改方案,在连接串中将用户名指定 //-u为url,-n为name
~]$ beeline -u jdbc:hive2://localhost:10000 -n centos
连接hive的JDBC服务流程:
=======================================
1、启动hiveserver2
hive --service hiveserver2 //启动hiveserver2
hiveserver2 //启动hiveserver2
2、启动客户端(IDEA的JDBC \ beeline)
代码:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.1</version>
</dependency>
public static void main(String[] args) throws Exception {
String driver = "org.apache.hive.jdbc.HiveDriver";
String url = "jdbc:hive2://192.168.23.101:10000/myhive";
//注册驱动
Class.forName(driver);
//通过DriverManager获取连接
Connection conn = DriverManager.getConnection(url);
//通过连接创建statement
Statement st = conn.createStatement();
//通过st执行增删改查
//其中查询需要使用executeQuery返回结果集
ResultSet rs = st.executeQuery("select * from tt");
while (rs.next()){
int id = rs.getInt(1);
String name = rs.getString(2);
int age = rs.getInt(3);
System.out.println(id + "/" + name+"/"+age);
}
}
beeline:
1、在linux命令行输入beeline
~]$ beeline
2、在beeline命令行进行连接
beeline> !connect jdbc:hive2://localhost:10000
托管表(内部表)和外部表:
==========================================
1、默认创建托管表(内部表),删除表之后会删除数据
create table xxx
2、外部表,删除表之后不会删除数据,只删除元数据信息 //相当于只把表的信息从mysql中删除
create external table xxx
hive的函数:
==================================
desc function [extended] substr; //显示函数的使用[扩展]方法
select current_database(); //显示当前数据库
select current_date(); //显示当前日期
select current_timestamp(); //显示详细时间(戳)
select explode(map or array) from xxx ; //将map或array炸开,单行转多行
select substr(str,index,len) //select('hello',2,3) ====> ell
//select substr('hello',-4); ====> ello
//select substr('hello',-4,3); ====> ell
select split('hello world',' ') //返回array类型
1、普通函数
2、表生成函数
3、聚合函数
hive的explode函数实现wordcount:
=============================================
1、字段:line =====> array类型 ====> collection items terminated by ''
create table t5(line ARRAY<string>) row format delimited
fields terminated by '\t'
collection items terminated by ' '
lines terminated by '\n'
stored as textfile;
炸开:select explode(line) as word from t5;
进行分组:select word, count(1) from (select explode(line) as word from t5) t6 group by word;
分组+排序:select word, count(1) as c from (select explode(line) as word from t5) t6 group by word order by c desc;
2、字段:line =====> string类型 ====> split(line,' ')
create table t6(line string) row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
截串操作:select split(line,' ') from t6;
炸开:select explode(split(line,' ')) as word from t6;
进行分组:select word, count(1) from (select explode(split(line,' ')) as word from t6) t7 group by word;
分组+排序:select word, count(1) as c from (select explode(split(line,' ')) as word from t6) t7 group by word order by c desc;
!!!!!!!!!!!!!函数中不允许加select,嵌套函数直接写即可
hive的.hiverc文件:
===================================
是hive的环境变量,在用户的家目录(~),每次启动hive,hive会自动识别并加载此文件
编辑内容:
set hive.cli.print.header=true
使用hive完成最高气温统计:
===================================
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ //ncdc气象局气象数据
截串,取得年份和气温 ===> 将气温进行类型转换 ====> 分组聚合(max)
1、建表
create table temp(line string) row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
2、加载数据
load data local inpath 'Temp/*' into table temp;
3、编写sql
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
cast(col as int) //cast(line as int) 将line列转换为int值
获取年份:year:substr(line,16,4)
获取气温:temp: cast(substr(line,88,5) as int) != 9999
获取气温和年份:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp;
过滤无效气温值:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999;
select a.year, max(a.temp) from
(select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999) as a
group by a.year;
表的相关操作:
=========================================
like
create table ep like employee; //创建表结构
as
create table ep2 as select * from employee; //复制表
create table ep3 as select ep2.name, ep2.sex_age from ep2; //复制表,指定字段
truncate
truncate table ep3 //截断(清空)表
column
添加列
alter table ep3 add columns(id int , age int);
删除列
alter table drop column(id) //删除列不成功
修改列
alter table ep3 change id no string ; //将id列变为no并指明数据格式string
替换列
alter table ep3 replace columns(name string, id int, age int) //全部替换列
分区表:目录
select ... where ;
1,tom,20,henan,kaifeng
创建分区表:
create table users(id int, name string, age int) partitioned by (province string, city string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
在分区表中静态添加分区:
alter table users add partition(province='henan', city='luoyang') ;
alter table users add
partition(province='beijing', city='beijing')
partition(province='hebei', city='xiongan')
partition(province='hebei', city='shijiazhuang')
;
删除分区:[IF EXISTS] 如果存在则删除,不存在不报错
ALTER TABLE users DROP [IF EXISTS] PARTITION (province='beijing', city='beijing') ;
查看分区表的分区:
show partitions users;
load数据到分区表:
load data local inpath '' overwrite into table users partition(province='beijing', city='beijing');
insert:
===========================================
使用insert命令动态插入分区:
1、设置动态分区严格模式关闭:
原因:避免因为分区数过多引发性能问题,严格模式声明至少存在一个静态分区
关闭严格模式
set hive.exec.dynamic.partition.mode=nonstrict
未分区表:id,name,age.province,city
分区表: id,name,age partitions:province,city
2、动态插入数据,默认分区字段是指定的select语句中最后n个字段
insert into table users partition(province , city)
select use0.age, user0.province, user0.city,user0.id,user0.name from user0;
使用insert命令进行数据的导出:注意,数据只能覆盖,不能拿掉overwrite
insert overwrite local directory '/home/centos/users' row format delimited
fields terminated by '\t'
select * from users;
将数据插入到多个目录下
from users
insert overwrite local directory '/home/centos/users'
row format delimited fields terminated by '\t'
select * from users;
insert overwrite local directory '/home/centos/user2'
row format delimited fields terminated by '\t'
select * from users;
使用insert命令将一个表数据插入到另一个表:
insert into t1 select t4.name from t4;
order by:全排序
使用一个reduce
1、在hive中手动指定reduce个数:
set mapreduce.job.reduces=2
2、使用group by操作
select province, count(*) from users group by province;
3、使用order by操作
select province, count(*) as c from users group by province order by c;
sort by:部分排序、
1、在hive中手动指定reduce个数:
set mapreduce.job.reduces=2
2、使用group by + sort by 操作
select province, count(*) as c from users group by province sort by c;
tianjin 1
hebei 2
beijing 9
henan 2
distribute by:指定字段 对数据进行分区
select province, age from users distribute by province sort by age;
select province, count(*) as c from users group by province distribute by province sort by c;
distribute by 和group by的区别
1、group by 和聚合函数一起使用,分区判定未知
2、distribute by 可以不和聚合函数使用,分区判定为指定字段
cluster by :相当于 distribute by + sort by 指定分区的部分排序,但是分区字段和排序字段需要一致
select province, count(*) as c from users group by province cluster by province;
Hive作业和MR流程的对应关系:
==========================================
select ... from ... where ... having ... group by ... order by ... limit
map input map reduce partition 第二个job 第二个job的cleanup
limit:分页查询
==========================
将输出数据指定输出数量
select * from users limit 10 //前十行
select * from users limit 10,10 //从第十个开始,往下十行
join:
==============================
1、内连接
select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a, orders b where a.id=b.uid;
select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a inner join orders b on a.id=b.uid;
2、左外连接
select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid;
3、右外连接
select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid;
4、全外连接 mysql不支持
select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a full outer join orders b on a.id=b.uid;
user0表:
100 yyy 100 beijing beijing
7 tt 7 tianjin tianjin
1 tom 20 hebei langfang
2 tomas 30 henan xinyang
3 tomson 45 hebei shijiazhuang
4 tomason 6 beijing beijing
5 tomasLee 7 henan kaifeng
orders表:
1 order001 20 100
2 order002 30 5
3 order003 90 7
4 order004 30.6 6
5 order005 4.2 1
6 order006 5 2
手动指定大小表,reduce端join的实现