分区表和分桶表

1. 分区表

 静态分区(Static Partitioning)动态分区(Dynamic Partitioning)
分区创建 数据插入分区之前,需要手动创建每个分区 根据表的输入数据动态创建分区
适用场景 需要提前知道所有分区。适用于分区定义得早且数量少的用例 有很多分区,无法提前预估新分区,动态分区是合适的

1.1 静态分区

输出:n个目录

分区字段:新字段

分区表避免全表扫描,相当于索引

1)引入分区表
/user/hive/warehouse/log_partition/20200702
/user/hive/warehouse/log_partition/20200703
/user/hive/warehouse/log_partition/202007042)创建分区表
create external table dept_partition(
username string,
fullname string)
partitioned by(month string)
row format delimited fields terminated by ','   
lines terminated by '\n';

(3)加载数据导入分区
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200702");
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200703");
load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200704");

(4)结果
/user/hive/warehouse/log_partition/20200702/20200702.log
/user/hive/warehouse/log_partition/20200703/20200703.log
/user/hive/warehouse/log_partition/20200704/20200704.log5)查询
select * from dept_partition where month="20200702";


(6)增加分区(多个分区用空格)
alter table dept_partition add partition(month="20200702")

alter table dept_partition add partition(month="20200702") partition(month="20200703")

(7)删除分区(多个分区用逗号)
alter table dept_partition drop partition(month="20200702")
alter table dept_partition drop partition(month="20200702"),partition(month="20200703")

(8)查看分区
show partitions dept_partition;

(9)查看分区表结构
desc formatted dept_partition;

 1.2 动态分区

--开启动态分区
 set hive.exec.dynamic.partition=true;
 
 --设置为非严格模式(默认为严格模式)
 set hive.exec.dynamic.partition.mode=nonstrict;
 
--参数限制单次整体任务创建分区的数量上限为1000个;
hive.exec.max.dynamic.partitions = 1000;

--参数限制MapReduce任务单个任务(mapper或reducer任务)创建的分区数量为100;
hive.exec.max.dynamic.partitions.pernode = 100;

--参数限制所有单次整体map和reduce任务创建的最大文件数量上限为10000个;
hive.exec.max.created.files = 10000
--desc dept(无分区表)
--deptno,dname,loc

--创建动态分区表
create table dept_partiotion(dname string,loc int)
partitioned by (deptno int)
row format delimited fields terminated by "\t";

--插入数据,字段按位置对应
insert into table dept_partiotion partition(septno)
select dname,loc,deptno from dept;

 

2. 分桶表

输出:n个文件

分桶字段:表中已有字段

(1) 创建分桶表

create external table student_bucket(
username string,
fullname string)
clustered by(username) into 4 buckets
row format delimited fields terminated by ','   
lines terminated by '\n';

(2)导入数据( load 不能实现)

(1)创建非分桶表
create external table student(
username string,
fullname string)
row format delimited fields terminated by ','   
lines terminated by '\n';

(2)student表插入数据
load data local inpath "/opt/module/datas/student.txt" into table student;

(4)设置属性
1.开启分桶功能
set hive.enforce.bucketing=true
2.自己根据桶的个数决定reduce个数
set mapreduce.job.reduces=-1

(5)student_bucket表插入数据
insert into table student_bucket
select * from student;

 (3)分桶抽样

select * from student_bucket tablesample(bucket 1 out of 2 on id) 从桶1开始选取,选取2个桶的数据,【桶1,桶3】4/2=2
select * from student_bucket tablesample(bucket 1 out of 4 on id) 从桶1开始选取,选取1个桶的数据,【桶1】4/4=1
select * from student_bucket tablesample(bucket 1 out of 8 on id) 从桶1开始选取,选取50%的数据,【桶1】4/8=0.5

 

posted on 2020-12-05 13:50  happygril3  阅读(494)  评论(0)    收藏  举报

导航