Hive on Spark

一、版本如下

注意:Hive on Spark对版本有着严格的要求,下面的版本是经过验证的版本

a) apache-hive-2.3.2-bin.tar.gz

b) hadoop-2.7.2.tar.gz

c) jdk-8u144-linux-x64.tar.gz

d) mysql-5.7.19-1.el7.x86_64.rpm-bundle.tar

e) mysql-connector-java-5.1.43-bin.jar

f) spark-2.0.0.tgzspark源码包,需要从源码编译)

g) Redhat Linux 7.4 64

二、安装LinuxJDK、关闭防火墙

 

三、安装和配置MySQL数据库

a) 解压MySQL 安装包

b) 安装MySQL

yum remove mysql-libs

rpm -ivh mysql-community-common-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-libs-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-client-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-server-5.7.19-1.el7.x86_64.rpm

rpm -ivh mysql-community-devel-5.7.19-1.el7.x86_64.rpm  (可选)

 

c) 启动MySQL

systemctl start mysqld.service

 

d) 查看并修改root用户的密码

查看root用户的密码:cat /var/log/mysqld.log | grep password

登录后修改密码:alter user 'root'@'localhost' identified by 'Welcome_1';
 

e) 创建hive的数据库和hiveowner用户:

  • 创建一个新的数据库:create database hive;
  • 创建一个新的用户:
    create user 'hiveowner'@'%' identified by ‘Welcome_1’;
  • 给该用户授权

    grant all on hive.* TO 'hiveowner'@'%';

    grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';

四、安装Hadoop(以伪分布式为例)

由于Hive on Spark默认支持Spark on Yarn的方式,所以需要配置Hadoop

a) 准备工作:

  1. 配置主机名(编辑/etc/hosts文件)
  2. 配置免密码登录

b) Hadoop的配置文件如下:

hadoop-env.sh

JAVA_HOME

/root/training/jdk1.8.0_144

 

hdfs-site.xml

dfs.replication

1

数据块的冗余度,默认是3

dfs.permissions

false

是否开启HDFS的权限检查

core-site.xml

fs.defaultFS

hdfs://hive77:9000

NameNode的地址

hadoop.tmp.dir

/root/training/hadoop-2.7.2/tmp/

HDFS数据保存的目录

mapred-site.xml

mapreduce.framework.name

yarn

 

yarn-site.xml

yarn.resourcemanager.hostname

hive77

 

yarn.nodemanager.aux-services

mapreduce_shuffle

 

yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Spark on Yarn的方式,需要使用公平调度原则来保证Yarn集群中的任务都能获取到相等的资源运行。

 

c)  启动Hadoop

d) 通过Yarn Web Console检查是否为公平调度原则

五、编译Spark源码

(需要使用MavenSpark源码包中自带Maven

a) 执行下面的语句进行编译(执行时间很长,耐心等待

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

 

b) 编译成功后,会生成:spark-2.0.0-bin-hadoop2-without-hive.tgz

c) 安装和配置Spark

  1.目录结构如下:

  2.将下面的配置加入spark-env.sh

export JAVA_HOME=/root/training/jdk1.8.0_144

export HADOOP_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export YARN_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop

export SPARK_MASTER_HOST=hive77

export SPARK_MASTER_PORT=7077

export SPARK_EXECUTOR_MEMORY=512m

export SPARK_DRIVER_MEMORY=512m

export SPARK_WORKER_MEMORY=512m

  3.将hadoop的相关jar包放入sparklib目录下,如下:

cp ~/training/hadoop-2.7.2/share/hadoop/common/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/common/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/lib/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/*.jar jars/

cp ~/training/hadoop-2.7.2/share/hadoop/yarn/lib/*.jar jars/

  4.在HDFS上创建目录:spark-jars,并将sparkjars上传至该目录。这样在运行Application的时候,就无需每次都分发这些jar包。

  • hdfs dfs -mkdir /spark-jars
  • hdfs dfs -put jars/*.jar /spark-jars

d) 启动Sparksbin/start-all.sh,验证Spark是否配置成功
 

六、安装配置Hive

a) 解压Hive安装包,并把mysqlJDBC驱动放到HIvelib目录下,如下图:

 

b) 设置Hive的环境变量

HIVE_HOME=/root/training/apache-hive-2.3.2-bin

export HIVE_HOME

PATH=$HIVE_HOME/bin:$PATH

export PATH

c) 拷贝下面sparkjar包到Hivelib目录

  1. scala-library
  2. spark-core
  3. spark-network-common

d) HDFS上创建目录:/sparkeventlog用于保存log信息

  hdfs dfs -mkdir /sparkeventlog

 

e) 配置hive-site.xml,如下:

参数

参考值

javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/hive?useSSL=false

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

hiveowner

javax.jdo.option.ConnectionPassword

Welcome_1

hive.execution.engine

spark

hive.enable.spark.execution.engine

true

spark.home

/root/training/spark-2.0.0-bin-hadoop2-without-hive

spark.master

yarn-client

spark.eventLog.enabled

true

spark.eventLog.dir

hdfs://hive77:9000/sparkeventlog

spark.serializer

org.apache.spark.serializer.KryoSerializer

spark.executor.memeory

512m

spark.driver.memeory

512m

 

f) 初始化MySQL数据库:schematool -dbType mysql -initSchema

g) 启动hive shell,并创建员工表,用于保存员工数据
 

h) 导入emp.csv文件:

  load data local inpath '/root/temp/emp.csv' into table emp1;

 

i) 执行查询,按照员工薪水排序:(执行失败

  select * from emp1 order by sal;

j) 检查Yarn Web Console

该错误是由于是Yarn的虚拟内存计算方式导致,可在yarn-site.xml文件中,将yarn.nodemanager.vmem-check-enabled设置为false,禁用虚拟内存检查。

<property>

   <name>yarn.nodemanager.vmem-check-enabled</name>

   <value>false</value>

</property>

 

 

k) 重启:HadoopSparkHive,并执行查询

 

 

最后说明一下:由于配置好了Spark on Yarn,我们在执行Hive的时候,可以不用启动Spark集群,因为此时都有Yarn进行管理。

posted @ 2018-10-26 11:15  夕阳如火  阅读(3433)  评论(1编辑  收藏  举报