数据导入(一):Hive On HBase

Hive集成HBase可以有效利用HBase数据库的存储特性,如行更新和列索引等。在集成的过程中注意维持HBase jar包的一致性。Hive与HBase的整合功能的实现是利用两者本身对外的API接口互相进行通信,相互通信主要是依靠hive_hbase-handler.jar工具类。
整合hive与hbase的过程如下:
1.将HBASE_HOME下的 hbase-common-0.96.2-hadoop2.jar 和 zookeeper-3.4.5.jar 拷贝(覆盖)到HIVE_HOME/lib文件夹下
2.修改HIVE_HOME/conf下hive-site.xml文件,添加如下内容(根据实际修改):

<property>
<name>hive.querylog.location</name>
<value>$HIVE_HOME/logs</value>
</property>

<property>
<name>hive.aux.jars.path</name> 
<value>file:///hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,file:///hive-0.7.1/lib/hbase-common-0.96.2-hadoop2.jar,file:///hive-0.7.1/lib/zookeeper-3.3.2.jar</value>
</property>

3.拷贝hbase-common-0.96.2-hadoop2.jar到所有hadoop节点(包括master)的hadoop/lib下
4.拷贝hbase/conf下的hbase-site.xml文件到所有hadoop节点(包括master)的hadoop/conf下。

注意:如果3,4两步跳过的话,运行hive时很可能出现如下错误
org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and
then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.

5.启动hive
单节点启动:bin/hive -hiveconf hbase.master=master:60000
如果hive-site.xml文件中没有配置hive.aux.jars.path,则可以按照如下方式启动。
hive --auxpath /opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,/opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,/opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar -hiveconf hbase.master=localhost:60000

集群启动:bin/hive -hiveconf hbase.zookeeper.quorum=node1,node2,node3 (所有的zookeeper节点)
经测试修改hive的配置文件hive-site.xml,就可以不用增加参数启动hive联合hbase

<property>
<name>hive.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>The list of zookeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

6.启动后进行测试
(1).构建Hbase表hbase_student

hbase> create 'hbase_student', 'info'

(2).构建hive外表hive_student, 并对应hbase_student表

Hive集成HBase需要在Hive表和HBase表之间建立映射关系,也就是Hive表的列(columns)和列类型(column types)与HBase表的列族(column families)及列限定词(column qualifiers)建立关联。
每一个在Hive表中的域都存在于HBase中,而在Hive表中不需要包含所有HBase中的列。
HBase中的RowKey对应到Hive中为选择一个域使用 :key 来对应,列族中的列在Hive中为 cf:q

CREATE EXTERNAL TABLE hive_student (rowkey string, name string, age int, phone string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:age,info:phone")
    TBLPROPERTIES("hbase.table.name" = "hbase_student"); 

7.数据导入及验证:
(1). 创建数据外表data_student

CREATE EXTERNAL TABLE data_student (rowkey string, name string, age int, phone string)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
  LOCATION '/test/hbase/tsv/input/'; 

(2). 数据通过hive_student导入到hbase_student表中

SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE hive_student SELECT rowkey, name, age, phone FROM data_student;

备注: 若遇到java.lang.IllegalArgumentException: Property value must not be null异常, 需要hive-0.13.0及以上版本支持

posted @ 2015-09-30 14:37  skyl夜  阅读(5098)  评论(0编辑  收藏  举报