公告

【HQL】DQL (三) 分桶和抽样查询

Posted on 2021-08-24 16:23 MissRong 阅读(300) 评论(0) 收藏举报

【HQL】DQL (三) 分桶和抽样查询

1、分桶表数据存储

 ----首先需要创建分桶表：
    create table stu_buck1(id int, name string)
    clustered by (id)
    into 4 buckets
    row format delimited fields terminated by '\t';
 ----导入数据到分桶表中：
    --load data local inpath '/opt/module/hiveTest/student.txt' into table stu_buck1;
 注：load加载方式不会分桶，需要通过子查询的方式才可
 ----先创建一个普通表stu
    create table stu(id int, name string)
    row format delimited fields terminated by '\t';
 ----然后向stu表中插入数据：
    load data local inpath '/opt/module/hiveTest/student.txt' into table stu;
 ----清空原有的分桶表stu_buck1中的数据：
    truncate tab1e stu_buck1;
 ----通过子查询的方式将数据导入分桶表中：
    insert into tab1e stu_buck1
    select id, name from stu;
 注：可是 这个时候查看HDFS的stu_buck1目录下只有一个分桶，需要set设置参数才可
 ----set hive.enforce.bucketing = true;

2、分桶抽样查询--【tablesample()】

        select * from stu_buck1 tablesample(bucket 1 out of 4 on id);

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。
不是桶数的倍数或者因子也可以，但是不推荐。x表示从哪个bucket开始抽取。

    --例如，table总bucket数为4，tablesample(bucket 4 out of 4)，
            表示总共抽取（4/4=）1个bucket的数据，抽取第4个bucket的数据。
            hive (default)> select * from stu_buck tablesample(bucket 1 out of 8 on id);
            OK
            stu_buck1.id stu_buck1.name
            1040 黄忠
            1032 妲己
            1024 干将
            1016 嬴政
            1008 诸葛亮
            Time taken: 0.388 seconds, Fetched: 5 row(s)
    注意：x的值必须小于等于y的值，否则
          FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

3、数据块抽样--【tablesample( X percent)】

Hive提供了另外一种按照百分比进行抽样的方式，这种是基于行数的，
按照输入路径下的数据块百分比进行的抽样。

        hive (default)> select * from stu tablesample(0.1 percent);

这种抽样方式不一定适用于所有的文件格式。
另外，这种抽样的最小抽样单元是一个HDFS数据块。
因此，如果表的数据大小小于普通的块大小128M的话，那么将会返回所有行。

        SELECT * FROM source TABLESAMPLE(100M) s;
        SELECT * FROM source TABLESAMPLE(10 ROWS);

刷新页面返回顶部

MissRong

现时的明艳， 源于曾经奋斗的泪泉和牺牲的血雨。

公告

【HQL】DQL (三) 分桶和抽样查询

【HQL】DQL (三) 分桶和抽样查询

1、分桶表数据存储

2、分桶抽样查询--【tablesample()】

3、数据块抽样--【tablesample( X percent)】

现时的明艳，源于曾经奋斗的泪泉和牺牲的血雨。