Spark对HBase读写数据

HBase安装

在HBase学习中有安装指导博客

创建HBase表

启动Hadoop、Spark

//启动hadoop
./sbin/start-dfs.sh
//启动HBase
./bin/start-hbase.sh
./bin/hbase shell

创建表

//先删除student表
disable 'student'
drop 'student'

//建表,表名student,列族info,列限定符name,gender,age
create 'student','info'

插入数据

//插入数据,'1'为行键
put 'student','1','info:name','Xueqian'
put 'student','1','info:gender','F'
put 'student','1','info:age','23'

put 'student','2','info:name','Weiliang'
put 'student','2','info:gender','M'
put 'student','2','info:age','24'

Spark配置

下载jar包

把hbase/lib下的jar包拷贝到spark/jars目录下。

拷贝的jar包有:hbase*.jar,guava-12.0.1.jar,htrace-core-3.1.0-incubating.jar,protobuf-java-2.5.0.jar

需要访问[https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001]网站下载spark-examples_2.11/1.6.0-typesafe-001.jar,后保存在spark/jars

配置spark-env.sh文件
cd /usr/local/spark/conf
sudo gedit spark-env.sh

//添加内容
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):$(/usr/local/hbase/bin/hbase classpath):/usr/local/spark/jars/hbase/*

编写程序读写HBase数据

读取数据

用SparkContext提供的newAPIHadoopRDD API将表的内容以RDD形式加载到Spark中

  • 读取HBase数据代码
from pyspark import SparkConf,SparkContext
conf = SparkConf().setMaster("local").setAppName("ReadHBase")
sc = SparkContext(conf = conf)
host = 'localhost'
table = 'student'
conf = {"hbase.zookeeper.quorum":host,"hbase.mapreduce.inputtable":table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for(k,v) in output:
    print(k,v)
  • spark-submit提交代码
/usr/local/spark/bin/spark-submit SparkOperateHBase.py
写入数据
  • 写入HBase数据代码
from pyspark import SparkConf,SparkContext

conf = SparkConf().setMaster("local").setAppName("WriteHBase")
sc = SparkContext(conf = conf)
host = "localhost"
table = "student"
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum":host,"hbase.mapred.outputtable":table,"mapreduce.outputformat.class":"org.apache.hadoop.hbase.mapreduce.TableOutputFormat","mapreduce.job.output.key.class":"org.apache.hadoop.hbase.io.ImmutableBytesWritable","mapreduce.job.output.value.class":"org.apache.hadoop.io.Writable"}
rawData = ['3,info,name,Rongcheng','3,info,gender,M','3,info,age,26','4,info,name,Guanhua','4,info,gener,M','4,info,age,27']
sc.parallelize(rawData).map(lambda x: (x[0],x.split(','))).saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
posted @ 2020-03-07 11:10  Tanglement  阅读(449)  评论(0编辑  收藏  举报