HBase学习记录

一、HBase简介

1.1 HBase定义

HBase是一种分布式、可扩展、支持海量数据存储的NoSQL数据库。

1.2 HBase数据模型

逻辑上,HBase的数据模型同关系型数据库很类似,数据存储在一张表中,有行有列。但从HBase的底层物理存储结构(K-V)来看,HBase更像是一个multi-dimensional map。

1.2.1数据模型

1Name Space

命名空间,类似于关系型数据库的DatabBase概念,每个命名空间下有多个表。HBase有两个自带的命名空间,分别是hbase和default,hbase中存放的是HBase内置的表,default表是用户默认使用的命名空间。

2)Region

类似于关系型数据库的表概念。不同的是,HBase定义表时只需要声明列族即可,不需要声明具体的列。这意味着,往HBase写入数据时,字段可以动态、按需指定。因此,和关系型数据库相比,HBase能够轻松应对字段变更的场景。

3)Row

HBase表中的每行数据都由一个RowKey和多个Column(列)组成,数据是按照RowKey的字典顺序存储的,并且查询数据时只能根据RowKey进行检索,所以RowKey的设计十分重要。

4)Column

HBase中的每个列都由Column Family(列族)和Column Qualifier(列限定符)进行限定,例如info:name,info:age。建表时,只需指明列族

,而列限定符无需预先定义。

5)Time Stamp

用于标识数据的不同版本(version),每条数据写入时,如果不指定时间戳,系统会自动为其加上该字段,其值为写入HBase的时间。

6)Cell

{rowkey, column Familycolumn Qualifier, time Stamp} 唯一确定的单元。cell中的数据是没有类型的,全部是字节码形式存贮。

 

1.3 HBase基本架构

 

 

架构角色:

1)Region Server

Region Server为 Region的管理者,其实现类为HRegionServer,主要作用如下:

对于数据的操作:get, put, delete;

对于Region的操作:splitRegion、compactRegion。

2)Master

Master是所有Region Server的管理者,其实现类为HMaster,主要作用如下:

对于表的操作:create, delete, alter

对于RegionServer的操作:分配regions到每个RegionServer,监控每个RegionServer的状态,负载均衡和故障转移。

3)Zookeeper

HBase通过Zookeeper来做Master的高可用、RegionServer的监控、元数据的入口以及集群配置的维护等工作。

4)HDFS

HDFSHBase提供最终的底层数据存储服务,同时为HBase提供高可用的支持。

 

 

二、HBase Shell操作

 

2.1 HBase Shell操作

2.1.1 基本操作

1.进入HBase客户端命令行

Hbase的shell操作bug太多,比如无法修改字符,还有下图中的死循环,不建议使用,了解即可

 

首先需要在/etc/profile.d/ 文件夹下创建好

[root@hadoop102 hbase]# hbase shell

2.查看帮助命令

hbase(main):001:0> help
HBase Shell, version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu Apr  6 19:36:54 PDT 2017
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

COMMAND GROUPS:
  Group name: general
  Commands: status, table_help, version, whoami

  Group name: ddl
  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters

  Group name: namespace
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

  Group name: dml
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

  Group name: tools
  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, trace, unassign, wal_roll, zk_dump

  Group name: replication
  Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs

  Group name: snapshots
  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot

  Group name: configuration
  Commands: update_all_config, update_config

  Group name: quotas
  Commands: list_quotas, set_quota

  Group name: security
  Commands: grant, list_security_capabilities, revoke, user_permission

  Group name: procedures
  Commands: abort_procedure, list_procedures

  Group name: visibility labels
  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility

SHELL USAGE:
Quote all names in HBase Shell such as table and column names.  Commas delimit
command parameters.  Type <RETURN> after entering a command to run it.
Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:

  {'key1' => 'value1', 'key2' => 'value2', ...}

and are opened and closed with curley-braces.  Key/values are delimited by the
'=>' character combination.  Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc.  Constants do not need to be quoted.  Type
'Object.constants' to see a (messy) list of all constants in the environment.

If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:

  hbase> get 't1', "key\x03\x3f\xcd"
  hbase> get 't1', "key\003\023\011"
  hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"

The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html

3.查看当前数据库中有哪些表

hbase(main):002:0> list

2.1.2 表的操作

1.创建表

hbase(main):002:0> create 'student','info'

2.插入数据到表

hbase(main):003:0> put 'student','1001','info:sex','male'
hbase(main):004:0> put 'student','1001','info:age','18'
hbase(main):005:0> put 'student','1002','info:name','Janna'
hbase(main):006:0> put 'student','1002','info:sex','female'
hbase(main):007:0> put 'student','1002','info:age','20'

3.扫描查看表数据

hbase(main):008:0> scan 'student'
hbase(main):009:0> scan 'student',{STARTROW => '1001', STOPROW  => '1001'}
hbase(main):010:0> scan 'student',{STARTROW => '1001'}

4.查看表结构

hbase(main):011:0> describe 'student'

5.更新指定字段的数据

hbase(main):012:0> put 'student','1001','info:name','Nick'
hbase(main):013:0> put 'student','1001','info:age','100'

6.查看“指定行”或“指定列族:列”的数据

hbase(main):014:0> get 'student','1001'
hbase(main):015:0> get 'student','1001','info:name'

7.统计表数据行数

hbase(main):021:0> count 'student'

8.删除数据

删除某rowkey的全部数据:

hbase(main):016:0> deleteall 'student','1001'

删除某rowkey的某一列数据:

hbase(main):017:0> delete 'student','1002','info:sex'

9.清空表数据

hbase(main):018:0> truncate 'student'

提示:清空表的操作顺序为先disable,然后再truncate

10.删除表

首先需要先让该表为disable状态:

hbase(main):019:0> disable 'student'

然后才能drop这个表:

hbase(main):020:0> drop 'student'

提示:如果直接drop表,会报错:ERROR: Table student is enabled. Disable it first.

11.变更表信息

info列族中的数据存放3个版本:

hbase(main):022:0> alter 'student',{NAME=>'info',VERSIONS=>3}
hbase(main):022:0> get 'student','1001',{COLUMN=>'info:name',VERSIONS=>3}

 

三、HBase API

项目实例

新建项目后在pom.xml中添加依赖:

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.3.1</version>
</dependency>

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.3.1</version>
</dependency>

在resources中添加log4j.properties

# Set root category priority to INFO and its only appender to CONSOLE.
#log4j.rootCategory=INFO, CONSOLE            debug   info   warn error fatal
log4j.rootCategory=info, CONSOLE, LOGFILE

# Set the enterprise logger category to FATAL and its only appender to CONSOLE.
log4j.logger.org.apache.axis.enterprise=FATAL, CONSOLE

# CONSOLE is set to be a ConsoleAppender using a PatternLayout.
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} %-6r [%15.15t] %-5p %30.30c %x - %m\n

# LOGFILE is set to be a File appender using a PatternLayout.
log4j.appender.LOGFILE=org.apache.log4j.FileAppender
log4j.appender.LOGFILE.File=d:\axis.log
log4j.appender.LOGFILE.Append=true
log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.LOGFILE.layout.ConversionPattern=%d{ISO8601} %-6r [%15.15t] %-5p %30.30c %x - %m\n

实现增删查方法

package com.gcl.hbase;

        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.hbase.*;
        import org.apache.hadoop.hbase.client.*;
        import org.apache.hadoop.hbase.util.Bytes;
        import org.junit.After;
        import org.junit.Before;
        import org.junit.Test;
        import sun.net.www.ParseUtil;

        import java.io.IOException;
        import java.util.ArrayList;
        import java.util.List;

public class HbaseClient {

    Connection connection;

    @Before
    public void before() throws IOException {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "hadoop102,hadoop103,hadoop104");
        configuration.set("hbase.zookeeper.property.clientPort","2181");

        //获取Connection
        connection = ConnectionFactory.createConnection(configuration);
    }


    @After
    public void after() throws IOException {
        //关闭
        connection.close();
    }

    /**
     * 建表
     * @throws IOException
     */
    @Test
    public void createTable() throws IOException {

        //用Connection
        //找Master
        Admin admin = connection.getAdmin();

        //建表
        TableName name = TableName.valueOf("student");

        if(!admin.tableExists(name)){
            //建立表的描述器对象
            HTableDescriptor descriptor = new HTableDescriptor(name);
            //列族描述对象
            HColumnDescriptor colum = new HColumnDescriptor("info");
            descriptor.addFamily(colum);
            admin.createTable(descriptor);
        }

        admin.close();

    }

    /**
     * 删除某表
     * @throws IOException
     */
    @Test
    public void deleteTable() throws IOException {
        //找Master
        Admin admin = connection.getAdmin();

        //删表
        TableName name = TableName.valueOf("student");

        if(admin.tableExists(name)) {
            //表删除之前首要Disable
            admin.disableTable(name);
            admin.deleteTable(name);
        } else {
            System.out.println("表不存在");
        }
        admin.close();
    }

    /**
     * 添加数据
     * @throws IOException
     */
    @Test
    public void addCell() throws IOException {
        //获取表
        Table student = connection.getTable(TableName.valueOf("student"));
//
//        //描述插入动作的对象
//        Put put = new Put("1001".getBytes());
//        put.add(new KeyValue("1001".getBytes(), //rowkey
//                "info".getBytes(),  //列族
//                "name".getBytes(),  //列限定名
//                "male".getBytes()   ////                ));
//
//        put.add(new KeyValue("1001".getBytes(), //rowkey
//                "info".getBytes(),  //列族
//                "sex".getBytes(),  //列限定名
//                "male".getBytes()   ////        ));

//        put.add(new KeyValue("1001".getBytes(), //rowkey
//                "info2".getBytes(),  //列族
//                "age".getBytes(),  //列限定名
//                "12".getBytes()   ////        ));

        //王表里插入数据
//        student.put(put);
        addData(student, "1001","info","name","zhangsan");
        addData(student, "1002","info","name","lisi");
        addData(student, "1003","info","name","wangwu");
        addData(student, "1001","info","age","12");
        addData(student, "1002","info","age","13");
        addData(student, "1003","info","age","15");

        //关闭表
        student.close();
    }

    private void addData(Table table, String rowkey, String cf, String cq, String value) throws IOException {
        Put put = new Put(rowkey.getBytes());
        put.add(new KeyValue(rowkey.getBytes(), //rowkey
                cf.getBytes(),  //列族
                cq.getBytes(),  //列限定名
                value.getBytes()   //
        ));
        table.put(put);
    }

    /**
     * 删除某行数据
     * @throws IOException
     */
    @Test
    public void deleteCell() throws IOException {
        Table student = connection.getTable(TableName.valueOf("student"));

        //删除
        //描述删除动作的对象
        Delete delete = new Delete("1001".getBytes());

        //删除的时候,如果不指定列族和列限定名,可以一次删除一行
        delete.addColumn("info".getBytes(), "name".getBytes());

        student.delete(delete);

        student.close();
    }

    /**
     * 得到某行数据
     * @throws IOException
     */
    @Test
    public void getCell() throws IOException {
        Table student = connection.getTable(TableName.valueOf("student"));

        //查数据
        Get get = new Get("1001".getBytes());

        get.addColumn("info".getBytes(), "sex".getBytes());

        Result result = student.get(get);

        Cell[] cells = result.rawCells();

        System.out.println("rowkey\tcf\tcq\tvalue");
        for (Cell cell: cells
        ) {
            String rowkey = Bytes.toString(CellUtil.cloneRow(cell));
            String cf = Bytes.toString(CellUtil.cloneFamily(cell));
            String cq = Bytes.toString(CellUtil.cloneQualifier(cell));
            String value = Bytes.toString(CellUtil.cloneValue(cell));

            System.out.println(rowkey + "\t" + cf + "\t" + cq + "\t" + value);
        }

        student.close();
    }

    /**
     * 扫描数据
     * @throws IOException
     */
    @Test
    public void scan() throws IOException {
        Table student = connection.getTable(TableName.valueOf("student"));

        //描述Scan动作的对象
        Scan scan = new Scan("1001".getBytes(), "1003".getBytes());

        ResultScanner scanner = student.getScanner(scan);

        //遍历结果
        for (Result result : scanner) {
            System.out.println("rowkey\tcf\tcq\tvalue");
            for (Cell cell: result.rawCells()
            ) {
                String rowkey = Bytes.toString(CellUtil.cloneRow(cell));
                String cf = Bytes.toString(CellUtil.cloneFamily(cell));
                String cq = Bytes.toString(CellUtil.cloneQualifier(cell));
                String value = Bytes.toString(CellUtil.cloneValue(cell));

                System.out.println(rowkey + "\t" + cf + "\t" + cq + "\t" + value);
            }
        }
        student.close();
    }

    /**
     * 删除多行数据
     * @throws IOException
     */
    @Test
    public void deleteMultipleRow() throws IOException {
        Table student = connection.getTable(TableName.valueOf("student"));

        deleteMulti(student, "1002", "1003");
        student.close();
    }

    
    private void deleteMulti(Table student, String ... rows) throws IOException {
        //建立delete集合
        List<Delete> deletes = new ArrayList<Delete>();

        //将要删除的行添加进集合
        for (String row : rows) {
            deletes.add(new Delete(row.getBytes()));
        }
        //删除
        student.delete(deletes);
    }

}

四、与Hive的集成

4.4.1 HBaseHive的对比

1Hive

(1) 数据仓库

Hive的本质其实就相当于将HDFS中已经存储的文件在Mysql中做了一个双射关系,以方便使用HQL去管理查询。

(2) 用于数据分析、清洗

Hive适用于离线的数据分析和清洗,延迟较高。

(3) 基于HDFSMapReduce

Hive存储的数据依旧在DataNode上,编写的HQL语句终将是转换为MapReduce代码执行。

2HBase

(1) 数据库

是一种面向列族存储的非关系型数据库。

(2) 用于存储结构化和非结构化的数据

适用于单表非关系型数据的存储,不适合做关联查询,类似JOIN等操作。

(3) 基于HDFS

数据持久化存储的体现形式是HFile,存放于DataNode中,被ResionServerregion的形式进行管理。

(4) 延迟较低,接入在线业务使用

面对大量的企业数据,HBase可以直线单表大量数据的存储,同时提供了高效的数据访问速度。

 

4.4.2 HBaseHive集成使用

  HBaseHive的集成在最新的两个版本中无法兼容。

  需要用以下的hibe-hbase-handler-1.2.2.jar

链接:https://pan.baidu.com/s/1gbJ_E6re65IJPdvg1jf59Q 提取码:rood

环境准备

因为我们后续可能会在操作Hive的同时对HBase也会产生影响,所以Hive需要持有操作HBaseJar,那么接下来拷贝Hive所依赖的Jar包(或者使用软连接的形式)。

 

export HBASE_HOME=/opt/module/hbase
export HIVE_HOME=/opt/module/hive

ln -s $HBASE_HOME/lib/hbase-common-1.3.1.jar  $HIVE_HOME/lib/hbase-common-1.3.1.jar
ln -s $HBASE_HOME/lib/hbase-server-1.3.1.jar $HIVE_HOME/lib/hbase-server-1.3.1.jar
ln -s $HBASE_HOME/lib/hbase-client-1.3.1.jar $HIVE_HOME/lib/hbase-client-1.3.1.jar
ln -s $HBASE_HOME/lib/hbase-protocol-1.3.1.jar $HIVE_HOME/lib/hbase-protocol-1.3.1.jar
ln -s $HBASE_HOME/lib/hbase-it-1.3.1.jar $HIVE_HOME/lib/hbase-it-1.3.1.jar
ln -s $HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar $HIVE_HOME/lib/htrace-core-3.1.0-incubating.jar
ln -s $HBASE_HOME/lib/hbase-hadoop2-compat-1.3.1.jar $HIVE_HOME/lib/hbase-hadoop2-compat-1.3.1.jar
ln -s $HBASE_HOME/lib/hbase-hadoop-compat-1.3.1.jar $HIVE_HOME/lib/hbase-hadoop-compat-1.3.1.jar

同时在hive-site.xml中修改zookeeper的属性,如下:

<property>
  <name>hive.zookeeper.quorum</name>
  <value>hadoop102,hadoop103,hadoop104</value>
  <description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
<property>
  <name>hive.zookeeper.client.port</name>
  <value>2181</value>
  <description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

1.案例一

目标:建立Hive表,关联HBase表,插入数据到Hive表的同时能够影响HBase表。

分步实现:

(1) Hive中创建表同时关联HBase

CREATE TABLE hive_hbase_emp_table(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");

提示:完成之后,可以分别进入HiveHBase查看,都生成了对应的表

 

(2) Hive中创建临时中间表,用于load文件中的数据

提示:不能将数据直接loadHive所关联HBase的那张表中

 

CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';

插入数据到emp表

0: jdbc:hive2://hadoop102:10000> insert into emp values(01,'zhansan','test',02,'2020-04-01','12.0','1200.0',001);

插入数据到hive_hbase_emp_table

0: jdbc:hive2://hadoop102:10000> insert into table hive_hbase_emp_table select * from emp;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1585707674958_0004
INFO  : The url to track the job: http://hadoop103:8088/proxy/application_1585707674958_0004/
INFO  : Starting Job = job_1585707674958_0004, Tracking URL = http://hadoop103:8088/proxy/application_1585707674958_0004/
INFO  : Kill Command = /opt/module/hadoop-2.7.4/bin/hadoop job  -kill job_1585707674958_0004
INFO  : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO  : 2020-04-01 10:47:16,896 Stage-0 map = 0%,  reduce = 0%
INFO  : 2020-04-01 10:47:34,253 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 6.64 sec
INFO  : MapReduce Total cumulative CPU time: 6 seconds 640 msec
INFO  : Ended Job = job_1585707674958_0004
No rows affected (56.331 seconds)

就可以在Hbase中查看数据了

hbase(main):002:0> scan 'hbase_emp_table'
ROW                                      COLUMN+CELL                                                                                                           
 1                                       column=info:comm, timestamp=1585709253833, value=1200.0                                                               
 1                                       column=info:deptno, timestamp=1585709253833, value=1                                                                  
 1                                       column=info:ename, timestamp=1585709253833, value=zhansan                                                             
 1                                       column=info:hiredate, timestamp=1585709253833, value=2020-04-01                                                       
 1                                       column=info:job, timestamp=1585709253833, value=test                                                                  
 1                                       column=info:mgr, timestamp=1585709253833, value=2                                                                     
 1                                       column=info:sal, timestamp=1585709253833, value=12.0                                                                  
1 row(s) in 0.1400 seconds

 

over~

posted @ 2020-04-01 17:20  Fries_Angle  阅读(175)  评论(0)    收藏  举报