分区表和分桶表
1. 分区表
静态分区(Static Partitioning) | 动态分区(Dynamic Partitioning) | |
---|---|---|
分区创建 | 数据插入分区之前,需要手动创建每个分区 | 根据表的输入数据动态创建分区 |
适用场景 | 需要提前知道所有分区。适用于分区定义得早且数量少的用例 | 有很多分区,无法提前预估新分区,动态分区是合适的 |
1.1 静态分区
输出:n个目录
分区字段:新字段
分区表避免全表扫描,相当于索引
(1)引入分区表 /user/hive/warehouse/log_partition/20200702 /user/hive/warehouse/log_partition/20200703 /user/hive/warehouse/log_partition/20200704 (2)创建分区表 create external table dept_partition( username string, fullname string) partitioned by(month string) row format delimited fields terminated by ',' lines terminated by '\n'; (3)加载数据导入分区 load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200702"); load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200703"); load data local inpath "/opt/module/datas/dept.txt" into table dept_partition partition(month="20200704"); (4)结果 /user/hive/warehouse/log_partition/20200702/20200702.log /user/hive/warehouse/log_partition/20200703/20200703.log /user/hive/warehouse/log_partition/20200704/20200704.log (5)查询 select * from dept_partition where month="20200702"; (6)增加分区(多个分区用空格) alter table dept_partition add partition(month="20200702") alter table dept_partition add partition(month="20200702") partition(month="20200703") (7)删除分区(多个分区用逗号) alter table dept_partition drop partition(month="20200702") alter table dept_partition drop partition(month="20200702"),partition(month="20200703") (8)查看分区 show partitions dept_partition; (9)查看分区表结构 desc formatted dept_partition;
1.2 动态分区
--开启动态分区 set hive.exec.dynamic.partition=true; --设置为非严格模式(默认为严格模式) set hive.exec.dynamic.partition.mode=nonstrict; --参数限制单次整体任务创建分区的数量上限为1000个; hive.exec.max.dynamic.partitions = 1000; --参数限制MapReduce任务单个任务(mapper或reducer任务)创建的分区数量为100; hive.exec.max.dynamic.partitions.pernode = 100; --参数限制所有单次整体map和reduce任务创建的最大文件数量上限为10000个; hive.exec.max.created.files = 10000
--desc dept(无分区表) --deptno,dname,loc --创建动态分区表 create table dept_partiotion(dname string,loc int) partitioned by (deptno int) row format delimited fields terminated by "\t"; --插入数据,字段按位置对应 insert into table dept_partiotion partition(septno) select dname,loc,deptno from dept;
2. 分桶表
输出:n个文件
分桶字段:表中已有字段
(1) 创建分桶表
create external table student_bucket( username string, fullname string) clustered by(username) into 4 buckets row format delimited fields terminated by ',' lines terminated by '\n';
(2)导入数据( load 不能实现)
(1)创建非分桶表 create external table student( username string, fullname string) row format delimited fields terminated by ',' lines terminated by '\n'; (2)student表插入数据 load data local inpath "/opt/module/datas/student.txt" into table student; (4)设置属性 1.开启分桶功能 set hive.enforce.bucketing=true 2.自己根据桶的个数决定reduce个数 set mapreduce.job.reduces=-1 (5)student_bucket表插入数据 insert into table student_bucket select * from student;
(3)分桶抽样
select * from student_bucket tablesample(bucket 1 out of 2 on id) 从桶1开始选取,选取2个桶的数据,【桶1,桶3】4/2=2 select * from student_bucket tablesample(bucket 1 out of 4 on id) 从桶1开始选取,选取1个桶的数据,【桶1】4/4=1 select * from student_bucket tablesample(bucket 1 out of 8 on id) 从桶1开始选取,选取50%的数据,【桶1】4/8=0.5
posted on 2020-12-05 13:50 happygril3 阅读(494) 评论(0) 收藏 举报