分区表和分桶表

1. 分区表

	静态分区(Static Partitioning)	动态分区（Dynamic Partitioning）
分区创建	数据插入分区之前，需要手动创建每个分区	根据表的输入数据动态创建分区
适用场景	需要提前知道所有分区。适用于分区定义得早且数量少的用例	有很多分区，无法提前预估新分区，动态分区是合适的

1.1 静态分区

输出：n个目录

分区字段：新字段

分区表避免全表扫描，相当于索引

（1）引入分区表
/user/hive/warehouse/log_partition/20200702
/user/hive/warehouse/log_partition/20200703
/user/hive/warehouse/log_partition/20200704

（2）创建分区表
create external table dept_partition(
username string,
fullname string)
partitioned by(month string)
row format delimited fields terminated by ','   
lines terminated by '\n';

（3）加载数据导入分区
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200702");
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200703");
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200704");

（4）结果
/user/hive/warehouse/log_partition/20200702/20200702.log
/user/hive/warehouse/log_partition/20200703/20200703.log
/user/hive/warehouse/log_partition/20200704/20200704.log

（5）查询
select * from dept_partition where month="20200702";


(6)增加分区(多个分区用空格)
alter table dept_partition add partition(month="20200702")

alter table dept_partition add partition(month="20200702") partition(month="20200703")

(7)删除分区（多个分区用逗号）
alter table dept_partition drop partition(month="20200702")
alter table dept_partition drop partition(month="20200702"),partition(month="20200703")

(8)查看分区
show partitions dept_partition;

(9)查看分区表结构
desc formatted dept_partition;

1.2 动态分区

--开启动态分区
 set hive.exec.dynamic.partition=true;
 
 --设置为非严格模式（默认为严格模式）
 set hive.exec.dynamic.partition.mode=nonstrict;
 
--参数限制单次整体任务创建分区的数量上限为1000个；
hive.exec.max.dynamic.partitions = 1000;

--参数限制MapReduce任务单个任务(mapper或reducer任务)创建的分区数量为100；
hive.exec.max.dynamic.partitions.pernode = 100;

--参数限制所有单次整体map和reduce任务创建的最大文件数量上限为10000个；
hive.exec.max.created.files = 10000

--desc dept(无分区表)
--deptno,dname,loc

--创建动态分区表
create table dept_partiotion(dname string,loc int)
partitioned by (deptno int)
row format delimited fields terminated by "\t";

--插入数据，字段按位置对应
insert into table dept_partiotion partition(septno)
select dname,loc,deptno from dept;

2. 分桶表

输出：n个文件

分桶字段：表中已有字段

（1）创建分桶表

create external table student_bucket(
username string,
fullname string)
clustered by(username) into 4 buckets
row format delimited fields terminated by ','   
lines terminated by '\n';

（2）导入数据（ load 不能实现）

(1)创建非分桶表
create external table student(
username string,
fullname string)
row format delimited fields terminated by ','   
lines terminated by '\n';

(2)student表插入数据
load data local inpath "/opt/module/datas/student.txt" into table student;

(4)设置属性
1.开启分桶功能
set hive.enforce.bucketing=true
2.自己根据桶的个数决定reduce个数
set mapreduce.job.reduces=-1

(5)student_bucket表插入数据
insert into table student_bucket
select * from student;

（3）分桶抽样

select * from student_bucket tablesample(bucket 1 out of 2 on id) 从桶1开始选取，选取2个桶的数据，【桶1，桶3】4/2=2
select * from student_bucket tablesample(bucket 1 out of 4 on id) 从桶1开始选取，选取1个桶的数据，【桶1】4/4=1
select * from student_bucket tablesample(bucket 1 out of 8 on id) 从桶1开始选取，选取50%的数据，【桶1】4/8=0.5

posted on 2020-12-05 13:50 happygril3 阅读(494) 评论(0) 收藏举报

刷新页面返回顶部

happygril3

分区表和分桶表

1. 分区表

1.1 静态分区

1.2 动态分区

2. 分桶表

（1）创建分桶表

（2）导入数据（ load 不能实现）

（3）分桶抽样

导航

公告

happygril3

分区表和分桶表

1. 分区表

1.1 静态分区

1.2 动态分区

2. 分桶表

（1） 创建分桶表

（2）导入数据（ load 不能实现）

（3）分桶抽样

导航

公告

（1）创建分桶表