hadoop 2.8.4集群安装(hive on spark)
2018-08-15 14:38 ljinch 阅读(599) 评论(0) 收藏 举报
1 概述
为了避免在后续搭建中继续踩坑,特此整理。(JAVA SDK安装在此忽略)
spark需要自己编译,已编译好的版本下载地址为:https://download.csdn.net/download/ljinch/10651973
hive版本: 3.0
hadoop 版本:2.8.4
2 服务器信息
|
|
master1 |
master2 |
slave1 |
|
IP |
192.168.0.220 |
192.168.0.221 |
192.168.0.222 |
3 服务器免密登录设置
3.1 设置hosts
vi /etc/hosts,增加以下内容:
192.168.0.220 master1
192.168.0.221 master2
192.168.0.222 slave1
3.2 生成秘钥
在所有服务器上执行以下命令:
ssh-keygen -t rsa -P ''
执行后,在/root/.ssh文件夹下得到公钥及私钥信息,如下图:
互传公钥信息:
cat id_rsa.pub>>authorized_keys
最后每个服务器上面的authorized_keys的内容都包含有各服务器的公钥信息,入下图所示:
到此,服务器免密登录设置完成。
4 环境变量文件设置
软件等都默认安装在/usr/local文件夹下。先提前配置好zookeeper/java/scala/hive/spark等的环境变量,配置如下:
vi /etc/profile
#Java Config
export JAVA_HOME=/usr/local/java/jdk1.8
export JRE_HOME=/usr/local/java/jdk1.8/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
#maven
export MAVEN_HOME=/usr/local/maven/maven3.5
# Zookeeper Config
export ZK_HOME=/usr/local/zookeeper
# HBase Config
export HBASE_HOME=/usr/local/hbase/hbase1.3
# Hadoop Config
export HADOOP_HOME=/usr/local/hadoop/hadoop2.8
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#hive
export HIVE_HOME=/usr/local/hive/hive3.0
export HIVE_CONF_DIR=/usr/local/hive/hive3.0/conf
#scala
export SCALA_HOME=/usr/local/scala/scala2.11
#spark
export SPARK_HOME=/usr/local/spark/spark2.3-without-hive
export PATH=.:${JAVA_HOME}/bin:${MAVEN_HOME}/bin:${SPARK_HOME}/bin:${HADOOP_HOME}/bin:${SCALA_HOME}/bin:${HIVE_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${HBASE_HOME}/bin:$PATH
5 zookeeper集群配置
下载地址:http://mirrors.cnnic.cn/apache/zookeeper
目前使用的是:zookeeper-3.4.13.tar.gz
tar -zxvf zookeeper-3.4.13.tar.gz
mv zookeeper-3.4.13 zookeeper
mv zookeeper /usr/local/
5.1 修改zookeeper的配置文件信息
打开配置文件目录:
cd /usr/local/zookeeper/conf
mv zoo_sample.cfg zoo.cfg
vi zoo.cfg
修改文件内容:
屏蔽 #dataDir=/tmp/zookeeper
新增以下内容:
dataDir=/usr/local/zookeeper/data
server.1=192.168.0.220:2888:3888
server.2=0.0.0.0:2888:3888
server.3=192.168.0.221:2888:3888
三台服务器都进行相同操作,注意,本机IP用0.0.0.0表示
5.2 新建myid文件
新建data 及myid文件:
mkdir /usr/local/zookeeper/data
cd /usr/local/zookeeper/data
vi myid
3台机器分别对应的写入1,2,3
5.3 启动服务
3台机器都执行启动命令:
cd /usr/local/ zookeeper
./bin/zkServer.sh start
启动成功后,可以通过
./bin/zkServer.sh status查看启动状态
6 Hadoop安装
hadoop下载地址:
使用的版本信息:hadoop-2.8.4.tar.gz
tar -zxvf hadoop-2.8.4.tar.gz
mkdir /usr/local/hadoop
mv hadoop-2.8.4 hadoop2.8
mv hadoop2.8 /usr/local/hadoop
cd /usr/local/hadoop/hadoop2.8/etc/hadoop
6.1 配置
6.1.1 vi hadoop-env.sh
配置JDK安装路径(默认JDK已经安装好了,安装路径:/usr/local/java/jdk1.8)
JAVA_HOME=/usr/local/java/jdk1.8
6.1.2 vi core-site.xml
<configuration>
<!-- hdfs地址,ha模式中是连接到nameservice -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<!-- 这里的路径默认是NameNode、DataNode、JournalNode等存放数据的公共目录,也可以单独指定 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop2.8/tmp</value>
</property>
<!-- 指定ZooKeeper集群的地址和端口。注意,数量一定是奇数,且不少于三个节点-->
<property>
<name>ha.zookeeper.quorum</name>
<value>master1:2181,master2:2181,slave1:2181</value>
</property>
</configuration>。
6.1.3 vi hdfs-site.xml
<!-- 指定副本数,不能超过机器节点数 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- 为namenode集群定义一个services name -->
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<!-- nameservice 包含哪些namenode,为各个namenode起名 -->
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>master1,master2</value>
</property>
<!-- 名为master1的namenode的rpc地址和端口号,rpc用来和datanode通讯 -->
<property>
<name>dfs.namenode.rpc-address.ns1.master1</name>
<value>master1:9000</value>
</property>
<!-- 名为master2的namenode的rpc地址和端口号,rpc用来和datanode通讯 -->
<property>
<name>dfs.namenode.rpc-address.ns1.master2</name>
<value>master2:9000</value>
</property>
<!--名为master1的namenode的http地址和端口号,用来和web客户端通讯 -->
<property>
<name>dfs.namenode.http-address.ns1.master1</name>
<value>master1:50070</value>
</property>
<!-- 名为master189的namenode的http地址和端口号,用来和web客户端通讯 -->
<property>
<name>dfs.namenode.http-address.ns1.master2</name>
<value>master2:50070</value>
</property>
<!-- namenode间用于共享编辑日志的journal节点列表 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master1:8485;master2:8485;slave1:8485/ns1</value>
</property>
<!-- 指定该集群出现故障时,是否自动切换到另一台namenode -->
<property>
<name>dfs.ha.automatic-failover.enabled.ns1</name>
<value>true</value>
</property>
<!-- journalnode 上用于存放edits日志的目录 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/hadoop/hadoop2.8/tmp/data/dfs/journalnode</value>
</property>
<!-- 客户端连接可用状态的NameNode所用的代理类 -->
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 一旦需要NameNode切换,使用ssh方式进行操作 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 如果使用ssh进行故障切换,使用ssh通信时用的密钥存储的位置 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- connect-timeout超时时间 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
6.1.4 vi mapred-site.xml
<configuration>
<!-- 采用yarn作为mapreduce的资源调度框架 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
6.1.5 vi yarn-site.xml
<configuration>
<!-- 启用HA高可用性 -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- 指定resourcemanager的名字 -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yrc</value>
</property>
<!-- 使用了2个resourcemanager,分别指定Resourcemanager的地址 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!-- 指定rm1的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master1</value>
</property>
<!-- 指定rm2的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>master2</value>
</property>
<!-- 指定当前机器master1作为rm1 -->
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm1</value>
</property>
<!-- 指定zookeeper集群机器 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>master1:2181,master2:2181,slave1:2181</value>
</property>
<!-- NodeManager上运行的附属服务,默认是mapreduce_shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 调度选项-->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<!-- 资源配置-->
<!-- RM上每个容器请求的最小分配-->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- RM上每个容器请求的最大分配-->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 可以为容器分配的vcores数-->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8096</value>
</property>
<!-- 是否对容器进行内存检测-->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
6.1.6 vi slaves
master1
master2
slave1
6.2 拷贝hadoop到其他机器
scp -r /usr/local/hadoop/hadoop2.8/etc/hadoop/* root@slave1:/usr/local/hadoop/hadoop2.8/etc/hadoop/
scp -r /usr/local/hadoop/hadoop2.8/etc/hadoop/* root@master2:/usr/local/hadoop/hadoop2.8/etc/hadoop/
6.3 修改其他两台机器yarn-site.xml配置
在备用主节点master2机器,即ResourceManager备用主节点上修改如下属性,表示当前机器作为rm2::
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm2</value>
</property>
同时删除slave1机器上的该属性对,因为slave1机器并不作为ResourceManager。
6.4 启动hadoop
6.4.1 启动journalnode
各个节点上启动journalnode
cd /usr/local/hadoop/hadoop2.8/sbin
./hadoop-daemon.sh start journalnode
6.4.2 格式化Namenode与zkfc
在master1机器中执行:
cd /usr/local/hadoop/hadoop2.8/bin
./hdfs namenode -format
./hdfs zkfc -formatZK
6.4.3 备用主节点同步主节点数据
在master2机器上面,执行同步操作
cd /usr/local/hadoop/hadoop2.8/bin
./hdfs namenode -bootstrapStanby
6.4.4 启动hdfs/YARN/ ZookeeperFailoverController
在master1机器:
cd /usr/local/hadoop/hadoop2.8/sbin
./start-dfs.sh
./start-yarn.sh
./hadoop-daemon.sh start zkfc
在master2机器上面,启动ResourceManager
cd /usr/local/hadoop/hadoop2.8/sbin
yarn-daemon.sh start resourcemanager
7 Scala安装
scala版本,2.11,包名:scala2.11.tgz
tar -zxvf scala2.11.tgz
mv scala-2.11.8/ scala2.11
mkdir /usr/local/scala
mv scala2.11 /usr/local/scala
复制到其他机器:
cd /usr/local
scp -r scala/ root@master2:/usr/local/
scp -r scala/ root@slave1:/usr/local/
8 Spark安装
后面会配置hive on spark,所以spark需要自己下载源码进行编译,spark源码版本 2.3.0,编译命令:
./dev/make-distribution.sh
--name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided "
包名:spark-2.3.0-bin-hadoop2-without-hive.tgz
tar -zxvf spark-2.3.0-bin-hadoop2-without-hive.tgz
mv spark-2.3.0-bin-hadoop2-without-hive spark2.3-without-hive
mkdir /usr/local/spark
mv spark2.3-without-hive /usr/local/sprak
cd /usr/local/sprak/conf
8.1 配置spark
8.1.1 spark-env.sh配置
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
加入以下内容:
export SCALA_HOME=/usr/local/scala/scala2.11
export JAVA_HOME=/usr/local/java/jdk1.8
export HADOOP_HOME=/usr/local/hadoop/hadoop2.8
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop2.8/etc/hadoop
export SPARK_HOME=/usr/local/spark/spark2.3-without-hive
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=/usr/local/spark/spark2.3-without-hive/lib
export SPARK_WORKER_DIR=/usr/local/spark/spark2.3-without-hive/work
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_MASTER_IP=master1
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/hadoop2.8/bin/hadoop classpath)
8.1.2 slaves配置
cp slaves.template slaves
加入节点信息
master2
slave1
8.1.3 spark-default.conf配置
cd /usr/local/spark/spark2.3-without-hive3.0/conf/
cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
配置信息如下:
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master1:9000/spark-logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 2g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
8.2 启动spark
cd /usr/local/spark/spark2.3-without-hive3.0/sbin/
./start-all.sh
9 mysql安装
略,可参考:https://www.cnblogs.com/ljinch/articles/7832307.html
10 hive安装
wget http://mirrors.shu.edu.cn/apache/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz
tar -zxvf apache-hive-3.0.0-bin.tar.gz
mv apache-hive-3.0.0-bin hive3.0
mkdir /usr/local/hive
mv hive3.0 /usr/local/hive
cd /usr/local/hive/hive3.0/conf
10.1 配置
10.1.1 hive-env.sh配置
export HADOOP_HOME=/usr/local/hadoop/hadoop2.8
export HIVE_CONF_DIR=/usr/local/hive/hive3.0/conf
export HIVE_AUX_JARS_PATH=/usr/local/hive/hive3.0/lib
10.1.2 hive-site.sh配置
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
<!-- Hive Execution Parameters -->
<!--jdbc -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://slave1:3306/hive2?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<!--<property>-->
<!--<name>spark.yarn.jars</name>-->
<!--<value>hdfs://master1:9000/spark/spark-jars/*</value>-->
<!--</property>-->
<!-- 指定HDFS中的hive仓库地址 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/root/hive/warehouse</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/root/hive</value>
</property>
<!--spark engine -->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/usr/local/spark/spark2.3-without-hive</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:9000/directory</value>
</property>
<!--sparkcontext -->
<property>
<name>spark.master</name>
<value>spark://master1:7077</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<!--下面的根据实际情况配置 -->
<property>
<name>spark.executor.instances</name>
<value>3</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>1</value>
</property>
<property>
<name>spark.submit.deployMode</name>
<value>client</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>4g</value>
</property>
<property>
<name>spark.driver.cores</name>
<value>1</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>2048m</value>
</property>
<property>
<name>spark.yarn.queue</name>
<value>default</value>
</property>
<property>
<name>spark.app.name</name>
<value>myInceptor</value>
</property>
<!--事务相关 -->
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
<property>
<name>spark.executor.extraJavaOptions</name>
<value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
</value>
</property>
<!--其它 -->
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>18000</value>
</property>
<!--webui -->
<property>
<name>hive.server2.webui.host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>hive.server2.webui.port</name>
<value>10002</value>
</property>
</configuration>
10.2 添加hive的lib
cp -r /usr/local/spark/spark2.3-without-hive/jars/* /usr/local/hive/hive3.0/lib/
上传mysql的驱动到/usr/local/hive/hive3.0/lib/文件夹下
10.3 启动hive并测试
通过Hive命令进入操作界面,进入Hive后,建库建表
create database db_hiveTest;
create table db_hiveTest.student(id int,name string) row format delimited fields terminated by '\t';
在系统上,准备student.txt文件。vim student.txt,中间的空格符使用Tab键
1001 zhangsan
1002 lisi
1003 wangwu
保存后,在hive shell界面上传文件到库
load data local inpath '/usr/local/hive/student.txt' into table db_hivetest.student;
select count(1) from db_hivetest.student
结果页面如下:

浙公网安备 33010602011771号