HA-Spark 部署
1.环境
- 组件分布
| 组件/主机 | hdp50 | hdp51 | hdp52 |
|---|---|---|---|
| Hdfs | DN | NN DN |
NN(备) DN |
| Yarn | RM NM |
NM | RM(备) NM Historyserver[198000] |
| Spark | master(备) worker |
worker | master(主) worker Jobhistory[18080] |
| Zookerper | √ | √ | √ |
- HADOOP:3.3.0
- SPARK:3.2.0
2. 配置
- 解压、配置环境变量 后!
2.1 spark-env.sh
- 配置spark所需环境:要么在此配置文件配置、要么在系统环境变量配置
# Options read in YARN client/cluster mode YARN 模式
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
#################################### 粘贴如下,追加写入##############################
# java scala 配置
# 当系统存在多个版本时, 这里优先级高
export JAVA_HOME=/opt/module/java
export SCALA_HOME=/opt/module/scala
# HADOOP_HOME: 在系统环境变量配置,这里的配置的优先级更高
# HADOOP_CONF_DIR:一般在系统环境变量没配,这里必须配,不然spark读取不了HDFS配置
export HADOOP_HOME=/opt/module/hadoop
export HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop
# zookeeper
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hdp50:2181,hdp51:2181,hdp52:2181 -Dspark.deploy.zookeeper.dir=/ha-spark"
# 任务历史:hdfs地址需要自己创建
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=24 -Dspark.history.fs.logDirectory=hdfs://hacluster/spark-jobhistory"
- 注意任务任务历史配置:-Dspark.history.fs.logDirectory=hdfs://hacluster/spark-jobhistory
- 此处填写的是hadoop集群位置,需要提前创建:
- hdfs dfs -mkdir /spark-jobhistory
- Hdfs地址填写格式:
- 普通:hdfs://namenode的IP:9000/spark-jobhistory
- 高可用:hdfs://hacluster/spark-jobhistory
- hacluster:Hadoop配置文件core-site.xml定义的hdp集群在zk上注册名称
- 报错:java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster
- 检查配置文件:export HADOOP_CONF_DIR=/opt/moudle/hadoop/etc/hadoop
- 临时方法:把core-site.xml和hdfs-site.xml复制(软连接)到spark配置目录下
2.2 spark-defaults.conf
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
###################################### when running spark-submit #############################
# 高可用 两个master,逗号隔开
spark.master spark://hdp52:7077,hdp50:7077
# 开启spark任务历史
spark.eventLog.enabled true
# 任务历史在hdfs路径
spark.eventLog.dir hdfs://hacluster/spark-jobhistory
# 任务历史 地址
spark.yarn.historyServer.address hdp52:18080
2.3 slaves
hdp50
hdp51
hdp52
2.4 yarn-site.xml
- spark on yarn 配置
<!-- 由于测试环境的虚拟机内存太少, 防止将来任务被意外杀死, 做如下配置 -->
<!--是否启动一个线程检查每个任务正使用的物理内存量,超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hdp52:19888/jobhistory/logs</value>
</property>
3. 启停
- 主master节点
# 启动 主master和 所有worker
# 如果hadoop和spark 都配置环境变量,start-all.sh会有冲突
./sbin/start-all.sh
# 任务历史
./sbin/start-history-server.sh
- 从master节点
# 启动 备master
./sbin/start-master.sh
浙公网安备 33010602011771号