Hive Day03

内部表和外部表的区别

概念本质上

内部表数据自己管理的，自己对数据有删除的权利的
在进行表删除的时候，数据和元数据一并删除的

外部表只是对hdfs的一个目录的数据进行关联，仅仅有使用权利，没有删除的权利；外部表在进行删除的时候，只删除元数据，原始数据（表中的数据）是不会删除的；

外部表的数据如何彻底删除

drop table t_name;
# 在去hdfs的对应的数据存储目录，执行
hadoop fs -rm -r path

应用场景上

外部表一般用与存储原始数据；公共数据；
内部表一般用于存储中间结果数据；通过原始数据得出来的中间结果数据只有当前部门可以用；直接存储为内部表即可；

存储目录上

外部表一般在进行建表的时候；需要手动指定表的数据目录为共享资源目录；用location关键字指定
内部表没有严格要求；一般使用的默认的目录；

外部表的共享资源路径
/source/log
部门1：建表

create external table log1(course string,name string,score int)row format delimited fields terminated by "," location '/source/log';

部门2：建表

create external table log2(course string,name string,score int)row format delimited fields terminated by "," location '/source/log';

dml的其他操作

insert into 和 insert overwrite区别

insert into 追加写入
insert overwrite 覆盖写入的，原来表中的数据会被清空；

数据导出

单模式导入单重数据导出

将表中的数据导出成文件

insert overwrite [local] directory directory1 select_statement

overwrite：覆盖写出
local:加上导出本地，不加导出hdfs
directory 指定的是本地或hdfs的路径

insert overwrite local directory '/home/hadoop/tmpdata/test_hive' select * from student_buk where grade=1303;

多模式导入多重导出

from from_statement
insert overwrite [local] directory '' select ... where ...
insert overwrite [local] directory '' select ... where ...

数据查询

语法：
join、 where、 group by、 order by 、 having、 limit

select ... from ... join ... where ... group by ... having ... order by ... limit

mysql执行顺序：
from -> join -> where -> group by -> select -> having -> order by -> limit

join

只支持等值链接，支持，支持 andandand，不支持 or
可以 join多于 2个表；

a表： id,name
b表： id,age

内连接

[inner] join
两个表中的相同的字段进行连接
只现实相同字段的内容

外连接

左外连接

以左边为主表；右表中有数据的就会关联；否则就会补null；

select from a left outer join b on a.id=b.id;

右外连接

以右边为主表；左表中有数据的就会关联；否则就会补null；

select from a right outer join b on a.id=b.id;

全外连接

full outer join
左表和右表中的并集

左表和右表中的数据都会关联上；

半连接

left semi join 左半连接

MySQL中有一个语法 in/exists 用于判断字段是否在给定的值中
对于hql语句，去执行这个语句的时候需要转换为MR任务；
in/exists在转换为MR任务的时候性能极地，这时候我们通用的解决方案就是join

select * from a left semi join b on a.id=b.id;

左半连接判断左表中的关连建是否在右表中存在
如果存在，则返回左表的字段
如果不存在，则不返回；

where

多个过滤条件， and or

group by

分组慎重考虑，

group by的执行的顺序是在seleect之前；对于select中的别名 group by 是不能使用的；
当查询语句中有group by的时候；select 的字段后面只能跟两种形式的数据
1. 聚合函数
2. group by的字段

select  grade，count(*) from student_buk group by grade；

hive中的排序

order by

order by 排序字段 asc|desc
全局排序；
针对所有的reducetask进行全局排序

select * from student_buk order by stu_id desc limit 20;

sort by

局部排序，针对每一个reducetask的输出结果进行排序

select * from student_test sort by age desc;

在进行数据分区的时候，每次随机选择一个字段；
随机选择的字段.hash%reducetask的个数

只有一个reducetask的时候 sort by = order by

distribute by 分桶

不会进行排序的

分桶的个数 = reducetask的个数
distribute by 后面指定的是分桶字段
对查询的结果进行分桶

select * from student_test distribute by age;

每个同的数据分配依据：

字符串分桶字段.hash%reducetask的个数；
数组分桶字段%reducetask的个数；
查询的时候，如果想指定某个字段进行分桶，并且在每一个桶中进行排序；这个时候使用distribute by 分桶字段+sort by 排序；

select * from student_test distribute by age sort by age  desc;

distribute by 和sort by 的字段可以不一样；

cluster by

相当于 distribute by + sort by

distribute by 和sort by的字段相同的时候，就可以使用cluster by；
现根据指定字段进行分桶；再根据指定的这个字段进行排序；

select * from student_test cluster by age;

注意：distribute by 和sort by 字段不一致的时候，不能使用cluster by替代的；

默认为升序；不能指定顺序；

union

UNION [ALL | DISTINCT]

all：不去重

distinct：去重

UNION用于将多个SELECT语句的结果组合到单个结果集中

区别是一个去重，一个是不去重

运行过程中参数配置

# 一个reducetask的吞吐量
In order ot change the average load for a reducer (in bytes)：
set hive.exec.reducers.bytes.per.reducer=<number>  256M
# 设置全局的启动的所有的reducetask 的个数
In order to limit the maximum number of reducers:
	set hive.exec.reducers.max=<number> 1009个
# 设置reducetask的个数
In order to set a constant number of reducers:
	set mapreduce.job.reduces=<number> -1
# -1表示：根据实际情况自动分配，当表是桶表的时候自动将这个值赋值为桶表的个数
# 当表不是桶表的时候，没有reduce值为0或者 有reduce那么就是1；

hive的查询语句什么时候转换为MR任务

hive.fetch.task.conversion 配置hive是否转换MR任务；

none 0 这个属性不可用，所有hql都需要转换为MR任务
minimal：1 ； select * ，过滤条件是分区字段；limit 这三种情况不会转化MR任务的；其他的都会转化为MR任务
more 2： select 任意字段，filter 过滤条件是任意字段，limit；这三种不会转换为MR；其他都会转换为MR任务；

hive的数据类型

基本数据类型 -- java

Tinyint
smallint
int
bigint
boolean
float
double
String
timestamp

复杂数据类型

array

数组类型，类似ArrayList

原始数据

id	names
1	zs,xsz,gs
2	ls,xlz,yl,dg
3	ww,xw

id   int
names array

创建表

create table test_array(id int,names array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";

collection items terminated by ：指定集合元素之间的分隔符

数据加载：

load data local inpath '/home/hadoop/tmpdata/test_array'
into table test_array;

访问：通过下标，从0开始的

select id,names[2] from test_array;

map

映射， key-value 类型

原始数据

id	family
1	dad:zs,mum:hanmeimei
2	sister:lily,brother:john,mum:Alice

id	int
family	map<string,string>

创建表

create table test_map(id int,family map<string,string>)
row format delimited fields terminated by "\t"
collection items terminated by ","
map keys terminated by ":"；

map keys terminated by map集合中key-value之间的分隔符；

注意：分割符指定的时候，是有外向里指定，或者说是从大到小指定的；

数据加载：

load data load inpath '/home/hadoop/tmpdata/test_map'
into table test_map;

访问：通过key找value；

select id,family["mum"] from test_map;

struct 四爪哥特

结构体；类似于java中的对象（class）
每一个对象

struct用于存储一组具有相同结构的数据；
相同结构：具有相同的列数；并且每一列对应的含义是一致的；

原始数据

id	stuinfo
1	zs,23,xian
2	ls,20,wuhan
3	ww,19,sichaun

id  int
stuinfo	struct

class stu{
string name;
int age;
string jiguan;
}

创建表

create table test_struct(id int,stuinfo struct<name:string,age:int,jiguan:string>)
row format delimited fields terminated by "\t";
collection items terminated by ",";

加载数据

load data local inpath '/home/hadoop/tmpdata/test_struct'
into table test_struct；

访问：对象类型的访问；对象.属性

select id,stuinfo.name from test_struct;

hive的视图

特点

只有逻辑视图，没有物化视图
hive的视图，不支持增删改，只支持查询；
hive的视图是相当于一个hql语句的快捷方式；
hive的视图在进行查询视图的时候，才会真正的执行
hive的视图在存储的时候，存在元数据库中仅仅存储的是视图代表的sql语句；

操作

创建视图

create view view_name as select ...
create view age_19 as select * from student_test where age>19;

显示视图列表

show tables;  # 展示当前数据库下的所有表及视图
show views;   # 展示当前数据库下所有的视图

显示视图的详细信息

desc view_name;
desc formatted view_name:

查询视图

将视图看做一个普通表；

select * from view_name;

删除视图

drop table view_name; # 不可用；只能删除表
drop view view_name；

hive函数

mysql中也会有函数，sum，avg，max，min，count；
为了便于数据处理和统计分析；

hive的函数可以分为三类：
UDF：USE DEFINE FUNCTION 用户自定义函数
UDAF： USER DEFINE AGGREGATE FUNCTION 用户定义聚合函数
UDTF： USER DEFINE TABLE FUNCTION 用户定义表函数

UDF

处理一条数据；处理完成之后，还是一条数据；
字符串函数

UDAF

一次加载多条数据，处理完成之后就剩一条数据
进多路出一路

UDTF

一次加载一条数据，处理完成之后，变成多条数据；
explode(array|map)

hive的内置函数

hive中自带的函数

show functions； # 查看所有的内置函数
desc function f_name; # 查看函数的描述信息
desc function extended f_name; # 查看函数的详细描述信息

内置函数案例
floor 向下取整
ceil 向上取整
round（x[,d]）取保留指定小数;参数一：需要处理的浮点数，参数二：保留小数的位数；参数2不指定的时候，四舍五入；
abs

hive的自定义函数

posted @ 2019-05-15 17:03 耳_东阅读(99) 评论(0) 收藏举报

刷新页面返回顶部

学习笔记

Hive Day03

内部表和外部表的区别

概念本质上

外部表的数据如何彻底删除

应用场景上

存储目录上

dml的其他操作

insert into 和 insert overwrite区别

数据导出

单模式导入 单重数据导出

多模式导入 多重导出

数据查询

join

内连接

外连接

左外连接

右外连接

全外连接

半连接

left semi join 左半连接

where

group by

hive中的排序

order by

sort by

distribute by 分桶

cluster by

union

运行过程中参数配置

hive的查询语句什么时候转换为MR任务

hive的数据类型

基本数据类型 -- java

复杂数据类型

array

map

struct 四爪哥特

hive的视图

特点

操作

创建视图

显示视图列表

显示视图的详细信息

查询视图

删除视图

hive函数

UDF

UDAF

UDTF

hive的内置函数

hive的自定义函数

公告

单模式导入单重数据导出

多模式导入多重导出