【Hive入门】 - 实践

之前实习写的笔记，上传留个备份。

1. 使用docker-compose快速搭建Hive集群

使用docker快速配置Hive环境

拉取镜像

2. Hive数据类型

隐式转换：窄的可以向宽的转换
显式转换：cast

3. Hive读写文件

SerDe:序列化（对象转为字节码）、反序列化

3.1 hive读写文件流程

反序列化（将文件映射为表）
调用inputFormat，转为<key,value>类型，然后进行反序列化。

3.2 SerDe语法

row format 指定序列化方式和分割符
- Delimited:默认序列化方式
- Json:改变序列化方式
hive 默认分割符"\001"

4. 存储路径

默认存储：/usr/hive/warehouse
指定存储路径：location hdfs_path

5. 练习

创建表并加载数据。

use ods;
create external table hero_info_1(
id bigint
comment "ID"
,
name string comment "英雄名称"
,
hp_max bigint
comment "最大生命"
)
comment "王者荣耀信息"
row format delimited
fields
terminated
by "\t"
;

将文件上传到相应路径，只要指定好分割符就可以。

hadoop fs -put test1.txt /usr/hive/warehouse/test.db/hero_info_1

map类型

create
table hero_info_2(
id int
comment "ID"
,
name string comment "英雄名字"
,
win_rate int
comment "胜率"
,
skin map<string,
int>
comment "皮肤：价格" -- 注意map分割类型
)
comment "英雄皮肤表"
row format delimited
fields
terminated
by "," -- 指定字段分割符
collection items terminated
by '-' -- 指定集合元素之间分割符
map keys
terminated
by ':'
;
-- 指定map元素kv之间的分割符

hadoop fs -put test2.txt /usr/hive/warehouse/test.db/hero_info_2

6. 指定路径使用

create
table t_hero_info_3(
id int
comment "ID"
,
name string comment "英雄名字"
,
win_rate int
comment "胜率"
,
skin map<string,
int>
comment "皮肤：价格" -- 注意map分割类型
)
comment "英雄皮肤表"
location "/tmp"
;

select *
from t_hero_info_3;

7. 内部表和外部表

外部表，删除不会删除hdfs文件
一般都用外部表

drop
table t_hero_info_3;
-- 文件也被删除

9. 分区表

上传多个文件
发现sql执行很慢，因为where需要进行全表扫描，所以效率慢
但是我们是根据射手类型来进行分类的，因此可以只扫描这一个分区的数据
分区字段不能是表中已经存在的字段

create external table t_hero_info_1(
id int
comment "ID"
,
name string comment "名字"
)
comment "英雄信息"
partitioned by (role string)
row format delimited
fields
terminated
by "\t"
;

静态分区

load
data
local inpath '/root/a.txt'
into
table t_hero_info_1 partition(role='sheshou'
)
;


-- 分区扫描 role是分区字段，不用全表扫描
select count(*
)
from t_hero_info_1 where role = "sheshou" and hp_max >
6000
;

10. 多重分区表

一般为双重分区表

create external table t_hero_info_1(
id int
comment "ID"
,
name string comment "名字"
)
comment "英雄信息"
partitioned by (province string, city string)
;
-- 分区字段存在顺序

-- 分区1
load
data
local inpath '/root/a.txt'
into
table t_hero_info_1 partition(province='beijing'
,city='chaoyang'
)
;
-- 分区2
load
data
local inpath '/root/b.txt'
into
table t_hero_info_1 partition(province='beijing'
,city='haidian'
)
;
-- 多重分区
load
data
local inpath '/root/b.txt'
into
table t_hero_info_1 partition(province='shanghai'
,city='pudong'
)
;

11. 动态分区

根据字段值来进行动态分区，使用insert+select
步骤：创建完分区表后，存在一个分区字段role，这时我们使用insert+select方法将原先表的数据插入到分区表中。

-- 原始数据表 t_all_hero
-- 分区表 t_all_hero_part

-- role这里是分区字段，role_main是我们给指定的分区类型
insert
into
table t_all_hero_part partition(role)
select tmp.*
, tmp.role_main from t_all_hero tmp;

在企业中，一般根据日期来进行分区表。
注意：分区的字段不能是已有的字段，即字段名字不能重复
分区的字段是个虚拟的字段，并不存在于底层当中

12. 分桶表

来进行优化查询
分桶是将一个文件分为若干个文件

规则

将文件中数据哈希，从而分到不同桶中。
一般是根据主键来进行分桶
创建一个普通的表，然后上传数据；通过inset+select来加载分桶

-- 创建分桶表
create
table test.t_state_info(
)
clustered
by(state)
into 5 buckets;
-- state一定是表中已有的字段

-- 插入数据
insert
into t_state_info_bucket select *
from t_state_info;

好处

可以基于分桶字段来查找，不需要进行全表过滤
join时减少笛卡尔积数量
 窗口函数
- over后返回的表行数不变

解析json

get_json_object:一次只能解析一个字段

posted @ 2025-07-21 21:52 wzzkaifa 阅读(12) 评论(0) 收藏举报

刷新页面返回顶部

wzzkaifa

【Hive入门】 - 实践

1. 使用docker-compose快速搭建Hive集群

2. Hive数据类型

3. Hive读写文件

3.1 hive读写文件流程

3.2 SerDe语法

4. 存储路径

5. 练习

6. 指定路径使用

7. 内部表和外部表

9. 分区表

10. 多重分区表

11. 动态分区

12. 分桶表

规则

好处

窗口函数

解析json

公告