HBase 数据迁移实战案例

              HBase 数据迁移实战案例

                                     作者:尹正杰

版权声明:原创作品,谢绝转载!否则将追究法律责任。

 

 

  通过HBase的相关JavaAPI,我们可以实现伴随HBase操作的MapReduce过程,比如使用MapReduce将数据从本地文件系统导入到HBase的表中,比如我们从HBase中读取一些原始数据后使用MapReduce做数据分析。

 

一.官方HBase-MapReduce案例 

  博主推荐阅读:
    http://hbase.apache.org/2.2/book.html#mapreduce

1>.统计"yinzhengjie2020:teacher"表中有多少行数据(需要同时启动Hadoop,Hbase集群)

[root@hadoop101.yinzhengjie.org.cn ~]# yarn jar /yinzhengjie/softwares/hbase/lib/hbase-server-2.2.4.jar org.apache.hadoop.hbase.mapreduce.RowCounter yinzhengjie2020:teacher

2>.使用MapReduce将本地数据导入到HBase

[root@hadoop101.yinzhengjie.org.cn ~]# vim fruit.tsv
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# cat fruit.tsv
1001    Apple    Red
1002    Pear    Yellow
1003    Pineapple    Yellow
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# vim fruit.tsv                  #创建测试文件
[root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -mkdir /input_fruit/
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# ll
total 4
-rw-r--r-- 1 root root 54 May 29 18:50 fruit.tsv
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -put fruit.tsv /input_fruit/
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -ls /input_fruit/
Found 1 items
-rw-r--r--   3 root supergroup         54 2020-05-29 18:55 /input_fruit/fruit.tsv
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# hdfs dfs -put fruit.tsv /input_fruit/    #将测试文件上传到HDFS上
[root@hadoop101.yinzhengjie.org.cn ~]# yarn jar /yinzhengjie/softwares/hbase/lib/hbase-server-2.2.4.jar org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color fruit hdfs://hadoop106.yinzhengjie.org.cn:9000/input_fruit

 

二.自定义HBase-MapReduce案例

1>.查看Hadoop的mapreduce执行任务所需要HBase的依赖包(也可以说是类路径)

[root@hadoop101.yinzhengjie.org.cn ~]# hbase mapredcp
/yinzhengjie/softwares/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.2.4.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/audience-ann
otations-0.5.0.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/log4j-1.2.17.jar:/yinzhengjie/softwares/hbase/bin/../lib/client-facing-thirdparty/slf4j-api-1.7.25.jar[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# hbase mapredcp

 

2>.配置环境变量

[root@hadoop101.yinzhengjie.org.cn ~]# cat /etc/profile.d/hbase.sh 
#Add ${HBASE_HOME} by yinzhengjie
HBASE_HOME=/yinzhengjie/softwares/hbase
PATH=$PATH:${HBASE_HOME}/bin
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# cat /etc/profile.d/hbase.sh
[root@hadoop101.yinzhengjie.org.cn ~]# cat /yinzhengjie/softwares/ha/etc/hadoop/hadoop-env.sh  
#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/yinzhengjie/softwares/jdk1.8.0_201

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

#Add by yinzhengjie
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/yinzhengjie/softwares/hbase/lib/*

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Enable extra debugging of Hadoop's JAAS binding, used to set up
# Kerberos security.
# export HADOOP_JAAS_DEBUG=true

# Extra Java runtime options.  Empty by default.
# For Kerberos debugging, an extended option set logs more invormation
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS"
# set heap args when HADOOP_HEAPSIZE is empty
if [ "$HADOOP_HEAPSIZE" = "" ]; then
  export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
fi
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
#export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Router-based HDFS Federation specific parameters
# Specify the JVM options to be used when starting the RBF Routers.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_DFSROUTER_OPTS=""
###

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
#export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_PID_DIR=/yinzhengjie/softwares/hadoop-2.10.0/pid
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# cat /yinzhengjie/softwares/ha/etc/hadoop/hadoop-env.sh

3>.查看原表数据

hbase(main):001:0> list
TABLE                                                                                                                                                                                                                                                                         
engineer                                                                                                                                                                                                                                                                      
teacher                                                                                                                                                                                                                                                                       
yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
3 row(s)
Took 0.3388 seconds                                                                                                                                                                                                                                                           
=> ["engineer", "teacher", "yinzhengjie2020:teacher"]
hbase(main):002:0>
hbase(main):001:0> list
hbase(main):002:0> describe 'yinzhengjie2020:teacher'
Table yinzhengjie2020:teacher is ENABLED                                                                                                                                                                                                                                      
yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'synopsis', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMF
ILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                               

1 row(s)

QUOTAS                                                                                                                                                                                                                                                                        
0 row(s)
Took 0.2182 seconds                                                                                                                                                                                                                                                           
hbase(main):003:0>
hbase(main):002:0> describe 'yinzhengjie2020:teacher'            #查看"yinzhengjie2020:teacher"表的结构
hbase(main):003:0> scan 'yinzhengjie2020:teacher'
ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
 100001                                                              column=synopsis:name, timestamp=1590715486602, value=JasonYin                                                                                                                                            
 10010                                                               column=synopsis:name, timestamp=1590726110051, value=\xE5\xB0\xB9\xE6\xAD\xA3\xE6\x9D\xB0                                                                                                                
2 row(s)
Took 0.0478 seconds                                                                                                                                                                                                                                                           
hbase(main):004:0> 
hbase(main):003:0> scan 'yinzhengjie2020:teacher'

4>.新建目标表

hbase(main):005:0> create 'yinzhengjie2020:student','synopsis'
Created table yinzhengjie2020:student
Took 1.3685 seconds                                                                                                                                                                                                                                                           
=> Hbase::Table - yinzhengjie2020:student
hbase(main):006:0>
hbase(main):005:0> create 'yinzhengjie2020:student','synopsis'
hbase(main):006:0> list
TABLE                                                                                                                                                                                                                                                                         
engineer                                                                                                                                                                                                                                                                      
teacher                                                                                                                                                                                                                                                                       
yinzhengjie2020:student                                                                                                                                                                                                                                                       
yinzhengjie2020:teacher                                                                                                                                                                                                                                                       
4 row(s)
Took 0.0091 seconds                                                                                                                                                                                                                                                           
=> ["engineer", "teacher", "yinzhengjie2020:student", "yinzhengjie2020:teacher"]
hbase(main):007:0> 
hbase(main):006:0> list
hbase(main):007:0> describe 'yinzhengjie2020:student'
Table yinzhengjie2020:student is ENABLED                                                                                                                                                                                                                                      
yinzhengjie2020:student                                                                                                                                                                                                                                                       
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'synopsis', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMF
ILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                               

1 row(s)

QUOTAS                                                                                                                                                                                                                                                                        
0 row(s)
Took 0.0651 seconds                                                                                                                                                                                                                                                           
hbase(main):008:0> 
hbase(main):007:0> describe 'yinzhengjie2020:student'            #创建另外一张表其结构与"yinzhengjie2020:teacher"一样
hbase(main):008:0> scan 'yinzhengjie2020:student'
ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
0 row(s)
Took 0.0765 seconds                                                                                                                                                                                                                                                           
hbase(main):009:0> 
hbase(main):008:0> scan 'yinzhengjie2020:student'

5>.编写MapReduce程序实现将"yinzhengjie2020:teacher"表中的数据迁移至"yinzhengjie2020:student"表中

 

posted @ 2020-02-07 05:59  JasonYin2020  阅读(282)  评论(0编辑  收藏  举报