Hbase过滤器

hbase过滤器
hive整合hbase

hbase过滤器

过滤器可以根据列簇、列、版本等更多条件进行过滤。
hbase三维有序（行键、列、版本有序）

过滤器的参数

使用过滤至少需要需要两类参数：一类是抽象的操作符，一类是比较器。

代码及资源：

https://gitee.com/yu-min-guo/hbase-test

操作符

LESS <
LESS_OR_EQUAL <=
EQUAL =
NOT_EQUAL <>
GREATER_OR_EQUAL >=
GREATER >
NO_OP 排除所有

六大比较过滤器

BinaryComparator

按字节索引顺序比较指定字节数组，采用Bytes.compareTo(byte[])

BinaryPrefixComparator

通BinaryComparator，只是比较左端前缀的数据是否相同

NullComparator

判断给定的是否为空

BitComparator

按位比较

RegexStringComparator

提供一个正则的比较器，仅支持 EQUAL 和非EQUAL

SubstringComparator

判断提供的子串是否出现在中

专用过滤器

单列值过滤器：SingleColumnValueFilter

SingleColumnValueFilter会返回满足条件的cell所在行的所有cell的值（即会返回一行数据）

通过SingleColumnValueFilter与查询文科班所有学生信息

 @Test
    /**
     * SingleColumnValueFilter
     * 可以指定一个列进行返回
     * 该过滤器会将符合过滤条件的列对应的cell所在的整行数据进行返回
     * 如果某条数据的列不符合条件，则会将整条数据进行过滤
     * 如果数据中不存在指定的列，则默认进行返回
     */
    public void SingleColumnValueFilterTest() throws IOException {
        SingleColumnValueFilter scvf = new SingleColumnValueFilter(
                "cf1".getBytes(),
                "age".getBytes(),
                CompareFilter.CompareOp.GREATER,
                "23".getBytes()
        );

        getFilterResult(scvf);

    }

列值排除过滤器：SingleColumnValueExcludeFilter

与SingleColumnValueFilter相反，会排除掉指定的列，其他的列全部返回

通过SingleColumnValueExcludeFilter与BinaryComparator查询文科一班所有学生信息，最终不返回clazz列

@Test
    /**
     * 过滤文科班的学生
     * SingleColumnValueExcludeFilter 单列值排除过滤器、
     * 与单列值过滤器最大的区别就是结果包不包含比较的列
     *
     */
    public void SingleColumnValueExcludeFilterTest() throws IOException {

        BinaryPrefixComparator bpc = new BinaryPrefixComparator("文科".getBytes());
        SingleColumnValueExcludeFilter scvef = new SingleColumnValueExcludeFilter(
                "cf1".getBytes(),
                "clazz".getBytes(),
                CompareFilter.CompareOp.EQUAL,
                bpc
        );
        getFilterResult(scvef);
    }

rowkey前缀过滤器：PrefixFilter

通过PrefixFilter查询以150010008开头的所有前缀的rowkey

@Test
    /**
     * 过滤出rowkey以 150010088 开头的学生
     * PrefixFilter rowkey前缀过滤器，效率最高
     * 相当于上面的比较器+过滤器
     */
    public void PrefixFilterTest() throws IOException {
        PrefixFilter prefixFilter = new PrefixFilter("150010088".getBytes());

        getFilterResult(prefixFilter);
    }

分页过滤器PageFilter

通过PageFilter查询第三页的数据，每页10条

使用PageFilter分页效率比较低，每次都需要扫描前面的数据，直到扫描到所需要查的数据

可设计一个合理的rowkey来实现分页需求

 @Test
    /**
     * 分页过滤器：PageFilter
     * 获取第四页的数据，每页10条
     *
     * 实际上需要遍历该页前面所有的数据，性能非常低
     */
    public void PageFilter1() throws IOException {
        int page = 4;
        int pageSize = 10;

        // 首先先获取第4页的第一条数据的rk
        int page_first = (page - 1) * pageSize + 1;
        PageFilter pageFilter1 = new PageFilter(page_first);

        Scan scan = new Scan();
        scan.setFilter(pageFilter1);
//        scan.setLimit(40); // PageFilter就相当于setLimit

        String rowkey = null;
        ResultScanner scanner = student.getScanner(scan);
        for (Result rs : scanner) {
            rowkey = Bytes.toString(rs.getRow());
        }

        Scan scan1 = new Scan();
        scan1.withStartRow(rowkey.getBytes());
        PageFilter pageFilter2 = new PageFilter(pageSize);
        scan1.setFilter(pageFilter2);
        ResultScanner scanner2 = student.getScanner(scan1);
        for (Result rs : scanner2) {
            String id = Bytes.toString(rs.getRow());
            String name = Bytes.toString(rs.getValue("info".getBytes(), "name".getBytes()));
            String age = Bytes.toString(rs.getValue("info".getBytes(), "age".getBytes()));
            String gender = Bytes.toString(rs.getValue("info".getBytes(), "gender".getBytes()));
            String clazz = Bytes.toString(rs.getValue("info".getBytes(), "clazz".getBytes()));
            System.out.println(id + "," + name + "," + age + "," + gender + "," + clazz);
        }

    }

Bloom Filter（布隆过滤器）

一个布隆过滤器包含一个二进制向量和n个随机映射函数。
布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法，缺点是有一定的误识别率和删除困难。
布隆过滤器可以100%确定元素不在集合，但不能100%确定在。
通过BloomFilter快速判断rowkey是否在这个Hfile中，从而过滤掉大部分的Hfile，减少需要扫描的Block。
Bloom Filter有两个很重要的参数
- 哈希函数个数
- 位数组的大小

HFile 中和 Bloom Filter 相关的Block，
Scanned Block Section（扫描HFile时被读取）：Bloom Block
Load-on-open-section（regionServer启动时加载到内存）：BloomFilter Meta Block、Bloom Index Block
Bloom Block：Bloom数据块，存储Bloom的位数组
Bloom Index Block：Bloom数据块的索引
BloomFilter Meta Block：从HFile角度看bloom数据块的一些元数据信息，大小个数等等。

HBase中每个HFile都有对应的位数组，KeyValue在写入HFile时会先经过几个hash函数的映射，映射后将对应的数组位改为1，get请求进来之后再进行hash映射，如果在对应数组位上存在0，说明该get请求查询的数据不在该HFile中。
HFile中的Bloom Block中存储的就是上面说得位数组，当HFile很大时，Data Block 就会很多，同时KeyValue也会很多，需要映射入位数组的rowKey也会很多，所以为了保证准确率，位数组就会相应越大，那Bloom Block也会越大，为了解决这个问题就出现了Bloom Index Block，一个HFile中有多个Bloom Block（位数组），根据rowKey拆分，一部分连续的Key使用一个位数组。这样查询rowKey就要先经过Bloom Index Block（在内存中）定位到Bloom Block，再把Bloom Block加载到内存，进行过滤。

hive整合hbase

create external table students_hbase
(
id string,
name string,
age string,
gender string, 
clazz string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = "
:key,
info:name,
info:age,
info:gender,
info:clazz
")
tblproperties("hbase.table.name" = "default:student");

posted @ 2021-12-06 22:16 莫尼莫尼阅读(251) 评论(0) 收藏举报

刷新页面返回顶部

不吃饭了！

Hbase过滤器

hbase过滤器

过滤器的参数

操作符

六大比较过滤器

BinaryComparator

BinaryPrefixComparator

NullComparator

BitComparator

RegexStringComparator

SubstringComparator

专用过滤器

单列值过滤器：SingleColumnValueFilter

列值排除过滤器：SingleColumnValueExcludeFilter

rowkey前缀过滤器：PrefixFilter

分页过滤器PageFilter

Bloom Filter（布隆过滤器）

hive整合hbase

公告