代码改变世界

hadoop 2.8.4集群安装(hive on spark)

2018-08-15 14:38  ljinch  阅读(599)  评论(0)    收藏  举报

 

1      概述

为了避免在后续搭建中继续踩坑,特此整理。JAVA SDK安装在此忽略)

 

spark需要自己编译,已编译好的版本下载地址为:https://download.csdn.net/download/ljinch/10651973

hive版本: 3.0

hadoop 版本:2.8.4

2      服务器信息

 

master1

master2

slave1

IP

192.168.0.220

192.168.0.221

192.168.0.222

 

 

 

3      服务器免密登录设置

3.1    设置hosts

vi /etc/hosts,增加以下内容:

192.168.0.220 master1

192.168.0.221 master2

192.168.0.222 slave1

 

3.2    生成秘钥

在所有服务器上执行以下命令:

ssh-keygen -t rsa -P ''

执行后,在/root/.ssh文件夹下得到公钥及私钥信息,如下图:

 

 

互传公钥信息:

cat id_rsa.pub>>authorized_keys

最后每个服务器上面的authorized_keys的内容都包含有各服务器的公钥信息,入下图所示:

 

到此,服务器免密登录设置完成。 

4      环境变量文件设置

软件等都默认安装在/usr/local文件夹下。先提前配置好zookeeper/java/scala/hive/spark等的环境变量,配置如下:

vi /etc/profile


#Java Config
export JAVA_HOME=/usr/local/java/jdk1.8
export JRE_HOME=/usr/local/java/jdk1.8/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
#maven
export MAVEN_HOME=/usr/local/maven/maven3.5
# Zookeeper Config
export ZK_HOME=/usr/local/zookeeper
# HBase Config
export HBASE_HOME=/usr/local/hbase/hbase1.3
# Hadoop Config
export HADOOP_HOME=/usr/local/hadoop/hadoop2.8
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#hive
export HIVE_HOME=/usr/local/hive/hive3.0
export HIVE_CONF_DIR=/usr/local/hive/hive3.0/conf

#scala
export SCALA_HOME=/usr/local/scala/scala2.11

#spark
export  SPARK_HOME=/usr/local/spark/spark2.3-without-hive

export PATH=.:${JAVA_HOME}/bin:${MAVEN_HOME}/bin:${SPARK_HOME}/bin:${HADOOP_HOME}/bin:${SCALA_HOME}/bin:${HIVE_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${HBASE_HOME}/bin:$PATH

5      zookeeper集群配置

下载地址:http://mirrors.cnnic.cn/apache/zookeeper

目前使用的是:zookeeper-3.4.13.tar.gz

tar -zxvf zookeeper-3.4.13.tar.gz

mv zookeeper-3.4.13 zookeeper

mv zookeeper /usr/local/

5.1    修改zookeeper的配置文件信息

 打开配置文件目录:

   cd /usr/local/zookeeper/conf 

   mv zoo_sample.cfg zoo.cfg

vi zoo.cfg

 修改文件内容:

  屏蔽 #dataDir=/tmp/zookeeper

新增以下内容:

dataDir=/usr/local/zookeeper/data
server.1=192.168.0.220:2888:3888
server.2=0.0.0.0:2888:3888 
server.3=192.168.0.221:2888:3888

三台服务器都进行相同操作,注意,本机IP用0.0.0.0表示

5.2    新建myid文件

新建data myid文件:

mkdir /usr/local/zookeeper/data

cd /usr/local/zookeeper/data

vi myid

     3台机器分别对应的写入1,2,3

5.3    启动服务

3台机器都执行启动命令:

cd /usr/local/ zookeeper

 ./bin/zkServer.sh start

启动成功后,可以通过

 

 ./bin/zkServer.sh status查看启动状态

 

6      Hadoop安装

hadoop下载地址:

使用的版本信息:hadoop-2.8.4.tar.gz

tar -zxvf hadoop-2.8.4.tar.gz

mkdir /usr/local/hadoop

mv hadoop-2.8.4 hadoop2.8

mv hadoop2.8 /usr/local/hadoop

cd /usr/local/hadoop/hadoop2.8/etc/hadoop

6.1    配置

6.1.1   vi hadoop-env.sh

配置JDK安装路径(默认JDK已经安装好了,安装路径:/usr/local/java/jdk1.8)

 JAVA_HOME=/usr/local/java/jdk1.8

   

6.1.2   vi core-site.xml

<configuration>

<!-- hdfs地址,ha模式中是连接到nameservice  -->

  <property>

    <name>fs.defaultFS</name>

    <value>hdfs://ns1</value>

  </property>

  <!-- 这里的路径默认是NameNode、DataNode、JournalNode等存放数据的公共目录,也可以单独指定 -->

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/usr/local/hadoop/hadoop2.8/tmp</value>

  </property>

 

  <!-- 指定ZooKeeper集群的地址和端口。注意,数量一定是奇数,且不少于三个节点-->

  <property>

    <name>ha.zookeeper.quorum</name>

    <value>master1:2181,master2:2181,slave1:2181</value>

  </property>

 

</configuration>。

6.1.3   vi hdfs-site.xml

<!-- 指定副本数,不能超过机器节点数  -->

  <property>

    <name>dfs.replication</name>

    <value>2</value>

  </property>

 

  <!-- 为namenode集群定义一个services name -->

  <property>

    <name>dfs.nameservices</name>

    <value>ns1</value>

  </property>

 

  <!-- nameservice 包含哪些namenode,为各个namenode起名 -->

  <property>

    <name>dfs.ha.namenodes.ns1</name>

    <value>master1,master2</value>

  </property>

 

  <!-- 名为master1的namenode的rpc地址和端口号,rpc用来和datanode通讯 -->

  <property>

    <name>dfs.namenode.rpc-address.ns1.master1</name>

    <value>master1:9000</value>

  </property>

 

  <!-- 名为master2的namenode的rpc地址和端口号,rpc用来和datanode通讯 -->

  <property>

    <name>dfs.namenode.rpc-address.ns1.master2</name>

    <value>master2:9000</value>

  </property>

 

  <!--名为master1的namenode的http地址和端口号,用来和web客户端通讯 -->

  <property>

    <name>dfs.namenode.http-address.ns1.master1</name>

    <value>master1:50070</value>

  </property>

 

  <!-- 名为master189的namenode的http地址和端口号,用来和web客户端通讯 -->

  <property>

    <name>dfs.namenode.http-address.ns1.master2</name>

    <value>master2:50070</value>

  </property>

 

  <!-- namenode间用于共享编辑日志的journal节点列表 -->

  <property>

    <name>dfs.namenode.shared.edits.dir</name>

    <value>qjournal://master1:8485;master2:8485;slave1:8485/ns1</value>

  </property>

 

  <!-- 指定该集群出现故障时,是否自动切换到另一台namenode -->

  <property>

    <name>dfs.ha.automatic-failover.enabled.ns1</name>

    <value>true</value>

  </property>

 

  <!-- journalnode 上用于存放edits日志的目录 -->

  <property>

    <name>dfs.journalnode.edits.dir</name>

    <value>/usr/hadoop/hadoop2.8/tmp/data/dfs/journalnode</value>

  </property>

 

  <!-- 客户端连接可用状态的NameNode所用的代理类 -->

  <property>

    <name>dfs.client.failover.proxy.provider.ns1</name>

    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

  </property>

 

  <!-- 一旦需要NameNode切换,使用ssh方式进行操作 -->

  <property>

    <name>dfs.ha.fencing.methods</name>

    <value>sshfence</value>

  </property>

 

  <!-- 如果使用ssh进行故障切换,使用ssh通信时用的密钥存储的位置 -->

  <property>

    <name>dfs.ha.fencing.ssh.private-key-files</name>

    <value>/root/.ssh/id_rsa</value>

  </property>

 

  <!-- connect-timeout超时时间 -->

  <property>

    <name>dfs.ha.fencing.ssh.connect-timeout</name>

    <value>30000</value>

  </property>

 

6.1.4    vi mapred-site.xml

<configuration>

<!-- 采用yarn作为mapreduce的资源调度框架 -->

  <property>

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>

 

</configuration>

6.1.5    vi yarn-site.xml

<configuration>

<!-- 启用HA高可用性 -->
 
<property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>

  <!-- 指定resourcemanager的名字 -->
 
<property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yrc</value>
  </property>

  <!-- 使用了2个resourcemanager,分别指定Resourcemanager的地址 -->
 
<property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
 
  <!-- 指定rm1的地址 -->
 
<property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>master1</value>
  </property>
 
  <!-- 指定rm2的地址  -->
 
<property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>master2</value>
  </property>
 
  <!-- 指定当前机器master1作为rm1 -->
 
<property>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm1</value>
  </property>
 
  <!-- 指定zookeeper集群机器 -->
 
<property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>master1:2181,master2:2181,slave1:2181</value>
  </property>
 
  <!-- NodeManager上运行的附属服务,默认是mapreduce_shuffle -->
 
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
    <!-- 调度选项-->
<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

    <!-- 资源配置-->

    <!--  RM上每个容器请求的最小分配-->
   
<property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <!--  RM上每个容器请求的最大分配-->
   
<property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>1</value>
    </property>

    <!-- 可以为容器分配的vcores数-->
   
<property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>1</value>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>8096</value>
    </property>

    <!-- 是否对容器进行内存检测-->
   
<property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>

    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>

</configuration>

 

6.1.6    vi slaves

master1

master2

slave1

6.2    拷贝hadoop到其他机器

scp -r /usr/local/hadoop/hadoop2.8/etc/hadoop/* root@slave1:/usr/local/hadoop/hadoop2.8/etc/hadoop/

 

scp -r /usr/local/hadoop/hadoop2.8/etc/hadoop/* root@master2:/usr/local/hadoop/hadoop2.8/etc/hadoop/

 

 

6.3    修改其他两台机器yarn-site.xml配置

在备用主节点master2机器,即ResourceManager备用主节点上修改如下属性,表示当前机器作为rm2::

<property>

    <name>yarn.resourcemanager.ha.id</name>

    <value>rm2</value>

  </property>

同时删除slave1机器上的该属性对,因为slave1机器并不作为ResourceManager。

 

6.4    启动hadoop

6.4.1   启动journalnode

各个节点上启动journalnode

cd /usr/local/hadoop/hadoop2.8/sbin

./hadoop-daemon.sh start journalnode

 

6.4.2   格式化Namenode与zkfc

在master1机器中执行:

cd /usr/local/hadoop/hadoop2.8/bin

./hdfs namenode -format

./hdfs zkfc -formatZK

 

6.4.3   备用主节点同步主节点数据

在master2机器上面,执行同步操作

cd /usr/local/hadoop/hadoop2.8/bin

./hdfs namenode -bootstrapStanby

 

6.4.4   启动hdfs/YARN/ ZookeeperFailoverController

在master1机器:

cd /usr/local/hadoop/hadoop2.8/sbin

./start-dfs.sh

./start-yarn.sh

./hadoop-daemon.sh start zkfc

 

在master2机器上面,启动ResourceManager

cd /usr/local/hadoop/hadoop2.8/sbin

yarn-daemon.sh start resourcemanager

 

 

7      Scala安装

scala版本,2.11,包名:scala2.11.tgz

tar -zxvf scala2.11.tgz

mv scala-2.11.8/ scala2.11

mkdir /usr/local/scala

mv scala2.11 /usr/local/scala

 

复制到其他机器:

cd /usr/local

scp -r scala/ root@master2:/usr/local/

scp -r scala/ root@slave1:/usr/local/

8      Spark安装

后面会配置hive on spark,所以spark需要自己下载源码进行编译,spark源码版本 2.3.0,编译命令:

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided "

 

包名:spark-2.3.0-bin-hadoop2-without-hive.tgz

tar -zxvf spark-2.3.0-bin-hadoop2-without-hive.tgz

mv spark-2.3.0-bin-hadoop2-without-hive spark2.3-without-hive

mkdir /usr/local/spark

mv spark2.3-without-hive /usr/local/sprak

cd /usr/local/sprak/conf

8.1    配置spark

8.1.1   spark-env.sh配置

cp spark-env.sh.template spark-env.sh

vi spark-env.sh

加入以下内容:
export SCALA_HOME=/usr/local/scala/scala2.11

export JAVA_HOME=/usr/local/java/jdk1.8

export HADOOP_HOME=/usr/local/hadoop/hadoop2.8   

export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop2.8/etc/hadoop 

export SPARK_HOME=/usr/local/spark/spark2.3-without-hive

export SPARK_LAUNCH_WITH_SCALA=0

export SPARK_LIBRARY_PATH=/usr/local/spark/spark2.3-without-hive/lib

export SPARK_WORKER_DIR=/usr/local/spark/spark2.3-without-hive/work

export SPARK_MASTER_WEBUI_PORT=8080

export SPARK_MASTER_PORT=7077

export SPARK_WORKER_PORT=7078

export SPARK_MASTER_IP=master1 

export SPARK_EXECUTOR_MEMORY=4G

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/hadoop2.8/bin/hadoop classpath)

 

8.1.2   slaves配置

cp slaves.template slaves
加入节点信息
master2
slave1

8.1.3   spark-default.conf配置

cd /usr/local/spark/spark2.3-without-hive3.0/conf/

cp spark-defaults.conf.template spark-defaults.conf

vi spark-defaults.conf

配置信息如下:

spark.master                     yarn

 spark.eventLog.enabled           true

 spark.eventLog.dir               hdfs://master1:9000/spark-logs

 spark.serializer                 org.apache.spark.serializer.KryoSerializer

 spark.driver.memory              2g

 spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

8.2    启动spark

cd /usr/local/spark/spark2.3-without-hive3.0/sbin/

./start-all.sh

 

9      mysql安装

略,可参考:https://www.cnblogs.com/ljinch/articles/7832307.html

10           hive安装

wget http://mirrors.shu.edu.cn/apache/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz

tar -zxvf apache-hive-3.0.0-bin.tar.gz

mv apache-hive-3.0.0-bin hive3.0

mkdir /usr/local/hive

mv hive3.0 /usr/local/hive

cd /usr/local/hive/hive3.0/conf

10.1      配置

 

10.1.1 hive-env.sh配置

 

export HADOOP_HOME=/usr/local/hadoop/hadoop2.8

export HIVE_CONF_DIR=/usr/local/hive/hive3.0/conf

export HIVE_AUX_JARS_PATH=/usr/local/hive/hive3.0/lib

 

10.1.2 hive-site.sh配置

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
<!-- Hive Execution Parameters -->

<!--jdbc -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://slave1:3306/hive2?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>


<!--<property>-->
<!--<name>spark.yarn.jars</name>-->
<!--<value>hdfs://master1:9000/spark/spark-jars/*</value>-->
<!--</property>-->
<!-- 指定HDFS中的hive仓库地址 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/root/hive/warehouse</value>
</property>

<property>
<name>hive.exec.scratchdir</name>
<value>/root/hive</value>
</property>

<!--spark engine -->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/usr/local/spark/spark2.3-without-hive</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:9000/directory</value>
</property>

<!--sparkcontext -->
<property>
<name>spark.master</name>
<value>spark://master1:7077</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<!--下面的根据实际情况配置 -->
<property>
<name>spark.executor.instances</name>
<value>3</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>1</value>
</property>
<property>
<name>spark.submit.deployMode</name>
<value>client</value>
</property>

<property>
<name>spark.executor.memory</name>
<value>4g</value>
</property>
<property>
<name>spark.driver.cores</name>
<value>1</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>2048m</value>
</property>
<property>
<name>spark.yarn.queue</name>
<value>default</value>
</property>
<property>
<name>spark.app.name</name>
<value>myInceptor</value>
</property>

<!--事务相关 -->
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
<property>
<name>spark.executor.extraJavaOptions</name>
<value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
</value>
</property>
<!--其它 -->
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>

<property>
<name>hive.metastore.client.socket.timeout</name>
<value>18000</value>
</property>

<!--webui -->
<property>
<name>hive.server2.webui.host</name>
<value>0.0.0.0</value>
</property>

<property>
<name>hive.server2.webui.port</name>
<value>10002</value>
</property>
</configuration>

 

10.2      添加hive的lib

cp -r /usr/local/spark/spark2.3-without-hive/jars/* /usr/local/hive/hive3.0/lib/

上传mysql的驱动到/usr/local/hive/hive3.0/lib/文件夹下

 

10.3      启动hive并测试

通过Hive命令进入操作界面,进入Hive后,建库建表

create database db_hiveTest;

create table db_hiveTest.student(id int,name string) row format delimited fields terminated by '\t';

 

在系统上,准备student.txt文件。vim student.txt,中间的空格符使用Tab键

1001    zhangsan

1002    lisi

1003    wangwu

保存后,在hive shell界面上传文件到库

load data local inpath '/usr/local/hive/student.txt' into table db_hivetest.student;

select count(1) from db_hivetest.student

结果页面如下: