|NO.Z.00077|——————————|BigDataEnd|——|Hadoop&Spark.V03|——|Spark.v03|Spark 原理 源码|Master Worker解析&Master启动流程|
一、Master 启动流程
### --- Master启动流程
~~~ Master是RpcEndpoint,实现了 RpcEndpoint 接口
~~~ Master的生命周期遵循 constructor -> onStart -> receive* -> onStop 的步骤
~~~ Master 的 onStart 方法中最重要的事情是:执行恢复
### --- Master HA的实现方式:
~~~ ZOOKEEPER。基于zookeeper的Active / Standby 模式。
~~~ 适用于生产模式,其基本原理是通过zookeeper来选举一个Master,其他的Master处于Standby状态;
~~~ FILESYSTEM。基于文件系统的单点恢复。主要用于开发或测试环境。
~~~ 为Spark提供目录保存spark Application和worker的注册信息,一旦Master发生故障,
~~~ 可通过重新启动Master进程(start-master.sh),恢复已运行的Spark Application和Worker 注册信息;
~~~ CUSTOM。允许用户自定义 HA 的实现,对于高级用户特别有用;
~~~ _。默应情况,未配置HA,不会持久化集群的数据,Master启动立即管理集群;
二、源码提取说明
### --- 源码提取说明:master.scala
~~~ # 源码提取说明:Master.scala
~~~ # 18行-111行
package org.apache.spark.deploy.master
private[deploy] class Master(
override val rpcEnv: RpcEnv,
address: RpcAddress,
webUiPort: Int,
val securityMgr: SecurityManager,
val conf: SparkConf)
extends ThreadSafeRpcEndpoint with Logging with LeaderElectable {
private val forwardMessageThread =
ThreadUtils.newDaemonSingleThreadScheduledExecutor("master-forward-message-thread")
// 给hadoop添加一些配置
private val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
// For application IDs 为了生成application的id
private def createDateFormat = new SimpleDateFormat("yyyyMMddHHmmss", Locale.US)
private val WORKER_TIMEOUT_MS = conf.getLong("spark.worker.timeout", 60) * 1000
private val RETAINED_APPLICATIONS = conf.getInt("spark.deploy.retainedApplications", 200)
private val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200)
private val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
// 如果想利用Spark本身实现选举和故障恢复,可以设置spark.deploy.recoveryMode为FILESYSTEM.
// 两种:Zookeeper和FILESYSTEM.
// Zookeeper:需要设置属性spark.deploy.zookeeper.url==>zookeeper集群的地址
// spark.deploy.zookeeper.dir ===》 spark在zookeeper集群上注册节点的路径
// FILESYSTEM:文件系统故障恢复模式
// spark.deploy.recoveryMode ==》 FILESYSTEM
// spark.deploy.recoveryDirectory ==》 存储Spark的Master节点的相关信息,以便于故障恢复使用,此种模式可以结合NFS等共享文件系统,
// 可以再当前Master进程挂掉时候,另外一个Master进程继续提供服务。
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
private val MAX_EXECUTOR_RETRIES = conf.getInt("spark.deploy.maxExecutorRetries", 10)
val workers = new HashSet[WorkerInfo]
val idToApp = new HashMap[String, ApplicationInfo]
private val waitingApps = new ArrayBuffer[ApplicationInfo]
val apps = new HashSet[ApplicationInfo]
private val idToWorker = new HashMap[String, WorkerInfo]
private val addressToWorker = new HashMap[RpcAddress, WorkerInfo]
private val endpointToApp = new HashMap[RpcEndpointRef, ApplicationInfo]
private val addressToApp = new HashMap[RpcAddress, ApplicationInfo]
private val completedApps = new ArrayBuffer[ApplicationInfo]
private var nextAppNumber = 0
private val drivers = new HashSet[DriverInfo]
private val completedDrivers = new ArrayBuffer[DriverInfo]
// Drivers currently spooled for scheduling
private val waitingDrivers = new ArrayBuffer[DriverInfo]
private var nextDriverNumber = 0
Utils.checkHost(address.host)
/*
Spark测量系统 MetricsSystem
Spark测量系统,由指定的instance创建,由source、sink组成,周期性地从source获取指标然后发送到sink,其中instance、source、sink的概念如下:
Instance:指定了谁在使用测量系统,在spark中有一些列如master、worker、executor、client driver这些角色,这些角色创建测量系统用于监控spark状态,目前在spark中已经实现的角色包括master、worker、executor、driver、applications
Source:指定了从哪里收集测量数据。在Spark测量系统中有两种来源:
(1) Spark内部指标,比如MasterSource、WorkerSource等,这些源将会收集Spark组件的内部状态
(2) 普通指标,比例JvmSource,通过配置文件进行配置
Sink:指定了往哪里输出测量数据
*/
private val masterMetricsSystem = MetricsSystem.createMetricsSystem("master", conf, securityMgr)
// 应用程序的监测系统
private val applicationMetricsSystem = MetricsSystem.createMetricsSystem("applications", conf, securityMgr)
private val masterSource = new MasterSource(this)
// After onStart, webUi will be set
private var webUi: MasterWebUI = null
private val masterPublicAddress = {
val envVar = conf.getenv("SPARK_PUBLIC_DNS")
if (envVar != null) envVar else address.host
}
private val masterUrl = address.toSparkURL
private var masterWebUiUrl: String = _
private var state = RecoveryState.STANDBY
// 持久化引擎,主要是处理 Master节点/worker节点新创建/移除的时候 持久化一些状态,用于故障恢复
private var persistenceEngine: PersistenceEngine = _
// 选举代理人的主要接口(特质)
private var leaderElectionAgent: LeaderElectionAgent = _
private var recoveryCompletionTask: ScheduledFuture[_] = _
private var checkForWorkerTimeOutTask: ScheduledFuture[_] = _
// As a temporary workaround before better ways of configuring memory, we allow users to set
// a flag that will perform round-robin scheduling across the nodes (spreading out each app
// among all the nodes) instead of trying to consolidate each app onto a small # of nodes.
private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)
// Default maxCores for applications that don't specify it (i.e. pass Int.MaxValue)
// spark默认部署的spark.deploy.defaultCores core核数
private val defaultCores = conf.getInt("spark.deploy.defaultCores", Int.MaxValue)
val reverseProxy = conf.getBoolean("spark.ui.reverseProxy", false)
if (defaultCores < 1) {
throw new SparkException("spark.deploy.defaultCores must be positive")
}
// Alternative application submission gateway that is stable across Spark versions
private val restServerEnabled = conf.getBoolean("spark.master.rest.enabled", false)
private var restServer: Option[StandaloneRestServer] = None
private var restServerBoundPort: Option[Int] = None
~~~ # 源码提取说明:Master.scala
~~~ # 112行-214行
{
val authKey = SecurityManager.SPARK_AUTH_SECRET_CONF
require(conf.getOption(authKey).isEmpty || !restServerEnabled,
s"The RestSubmissionServer does not support authentication via ${authKey}. Either turn " +
"off the RestSubmissionServer with spark.master.rest.enabled=false, or do not use " +
"authentication.")
}
override def onStart(): Unit = {
logInfo("Starting Spark master at " + masterUrl)
logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
/** 创建MasterUI */
webUi = new MasterWebUI(this, webUiPort)
webUi.bind()
masterWebUiUrl = "http://" + masterPublicAddress + ":" + webUi.boundPort
if (reverseProxy) {
masterWebUiUrl = conf.get("spark.ui.reverseProxyUrl", masterWebUiUrl)
webUi.addProxy()
logInfo(s"Spark Master is acting as a reverse proxy. Master, Workers and " +
s"Applications UIs are available at $masterWebUiUrl")
}
/** 定时任务 WORKER_TIMEOUT_MS=60,每60秒执行一次任务 检查worker节点是否超时 */
checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
// self是endpointRef,定时发送 masterEndpoint 的CheckForWorkerTimeOut信息
// 主要是检查 移除 超时没有发送心跳的 worker节点
self.send(CheckForWorkerTimeOut)
}
}, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
// 默认为true
if (restServerEnabled) {
val port = conf.getInt("spark.master.rest.port", 6066)
restServer = Some(new StandaloneRestServer(address.host, port, conf, self, masterUrl))
}
restServerBoundPort = restServer.map(_.start())
// 主节点注册资源 格式如 <app ID>.<executor ID (or "driver")>.<source name>.
masterMetricsSystem.registerSource(masterSource)
// 启动主节点的 MetricsSystem 只能启动一次
masterMetricsSystem.start()
// 启动应用程序的监测系统
applicationMetricsSystem.start()
// Attach the master and app metrics servlet handler to the web ui after the metrics systems are
// started.
masterMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)
applicationMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)
// 以上都是启动master和app的元数据统计系统
// java的序列化框架
val serializer = new JavaSerializer(conf)
// 持久化引擎,主要是处理 Master节点/worker节点新创建/移除的时候 持久化一些状态,用于故障恢复
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
case "ZOOKEEPER" =>
logInfo("Persisting recovery state to ZooKeeper")
val zkFactory =
new ZooKeeperRecoveryModeFactory(conf, serializer)
(zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
case "FILESYSTEM" =>
val fsFactory =
new FileSystemRecoveryModeFactory(conf, serializer)
(fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
case "CUSTOM" =>
val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))
val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer])
.newInstance(conf, serializer)
.asInstanceOf[StandaloneRecoveryModeFactory]
(factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
case _ =>
(new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
}
persistenceEngine = persistenceEngine_
// 主要通过利用Curator框架进行主备切换 leaderElectionAgent在ZooKeeperLeaderElectionAgent类中
leaderElectionAgent = leaderElectionAgent_
}
override def onStop() {
... ...
}
override def electedLeader() {
... ...
}
override def revokedLeadership() {
... ...
}
/**
* master的消息处理
* Master接收到ElectedLeader消息后会进行选举操作,由于local-cluster模式中只有一个Master,所以persistenceEngine没有持久化的App,
* Driver,Worker的信息,所以当前Master即为激活(ALIVE)状态的。这里涉及到Master的故障恢复。
*/
override def receive: PartialFunction[Any, Unit] = {
case ElectedLeader => ...
case CompleteRecovery => completeRecovery()
case RevokedLeadership =>
logError("Leadership has been revoked -- master shutting down.")
System.exit(0)
~~~ # 源码提取说明:Master.scala
~~~ # 215行-302行
/** 等待work注册上来
* Master收到RegisterWorker消息后的处理步骤:
* 1.创建WorkerInfo。
* 2.注册WorkerInfo.
* 3.向Worker发送RegisteredWorker消息,表示注册完成。
* 4.调用schedule方法进行资源调度。
*/
case RegisterWorker(
id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
workerHost, workerPort, cores, Utils.megabytesToString(memory)))
// 如果该Master不是active,不做任何操作,返回
// 如果注册过该worker id,向sender返回错误
// worker节点正在恢复中
if (state == RecoveryState.STANDBY) {
workerRef.send(MasterInStandby)
// workerRef.send()==>RpcEndpointRef.send()===>NettyRpcEnv.send()
} else if (idToWorker.contains(id)) {
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
// worker的id重复
// worker 节点不在恢复中
} else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
workerRef, workerWebUiUrl)
/** registerWorker(worker):注册WorkerInfo,其实就是将其添加到workers:HashSet[WorkerInfo]中,
* 并且更新worker id与worker以及worker address与worker的映射关系
*/
if (registerWorker(worker)) {
// 如果worker节点注册成功,就持久化worker节点的相关信息,为了以后恢复
persistenceEngine.addWorker(worker)
// 由worker.handleRegisterResponse(msg: RegisterWorkerResponse)方法去处理,
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
schedule()
} else {
// 如果worker注册失败,发送消息到sender
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress)
workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress))
}
}
// 注册应用程序App
// 如果是standby,那么忽略这个消息;否则注册application;返回结果并且开始调度
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
// 若之前注册过
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
// StandBy的master会忽略注册消息。如果AppClient发送请求道Standby的master
// 会粗发超时机制(默认是20秒),超时会重试
} else {
logInfo("Registering app " + description.name)
//创建app
//创建App应用程序 包括创建时间 appID ,描述,驱动driver,默认的cores(defaultCores)
val app = createApplication(description, driver)
// 保存Application相关信息,主要是保存到waitingApps列表中,供待会调用
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
// 将此App的信息持久化到ZK中
persistenceEngine.addApplication(app)
driver.send(RegisteredApplication(app.id, self))
// 调度driver和app
schedule()
}
case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
case DriverStateChanged(driverId, state, exception) =>
case Heartbeat(workerId, worker) =>
idToWorker.get(workerId) match {
// 如果worker节点心跳来了
case Some(workerInfo) =>
// 记录worker节点的最后心跳时间
workerInfo.lastHeartbeat = System.currentTimeMillis()
case None =>
if (workers.map(_.id).contains(workerId)) {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" Asking it to re-register.")
worker.send(ReconnectWorker(masterUrl))
} else {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" This worker was never registered, so ignoring the heartbeat.")
}
}
~~~ # 源码提取说明:Master.scala
~~~ # 303行-407行
// 将appId的app的状态置为WAITING,为切换Master做准备。
case MasterChangeAcknowledged(appId) => ...
// worker节点的调度器状态相应
// 通过workerId查找到worker,那么worker的state置为ALIVE,
// 并且查找状态为idDefined的executors,并且将这些executors都加入到app中,
// 然后保存这些app到worker中。可以理解为Worker在Master端的Recovery
case WorkerSchedulerStateResponse(workerId, executors, driverIds) => ... ...
// worker节点的最后状态
case WorkerLatestState(workerId, executors, driverIds) => ... ...
// 没有注册的应用程序
case UnregisterApplication(applicationId) => ... ...
case CheckForWorkerTimeOut => ... ...
//检查 移除 超时没有发送心跳的 worker节点
timeOutDeadWorkers()
}
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
case RequestSubmitDriver(description) => ... ...
// 请求杀死driver
case RequestKillDriver(driverId) => ... ...
// request driver status
// 查找请求的driver,如果找到则返回driver的状态
case RequestDriverStatus(driverId) => ... ...
//请求master的状态
//向sender返回master的状态
case RequestMasterState => ... ...
}
// 主节点链接失效
override def onDisconnected(address: RpcAddress): Unit = ... ...
/**
* Schedule the currently available resources among waiting apps. This method will becalled every time a new app joins or resource availability changes.
* 目前可用的资源在等待调度程序。每当新应用程序连接或资源可用性更改时,该方法将被调用。
*
* 为Application分配资源选择Worker(Executor),现在有两种策略:
*1.尽量打散。将一个Application尽可能的分配到不同的节点,可以通过设置spark.deploy.spreadOut来实现,默认值为true
*2.尽量集中。将一个Application尽可能de 分配到尽可能少的节点,CPU密集型而内存占用比较少的Application比较适合这种策略
*/
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// 得到活动状态的Worker
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state ==WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) {
// 如果是Standalone部署下的client方式提交的话,则driver的运行时本地提交节点
// 如果是cluster方式提交的话,则driver的运行节点是有master调度的
// iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For eachdriver, we
// start from the last worker that was assigned a driver, and continue onwards untilwe have
// explored all alive workers.
// 迭代等待驱动程序的副本,我们以循环的方式分配给每个等待的驱动程序。对于每个driver,
// 我们从最后一个被分配driver的worker开始,然后继续向前,直到我们探索所有活着的worker。
var launched = false
var numWorkersVisited = 0
// 这里循环每个driver,比如
// A写了一个程序,(需要内存:100M,需要核数:1个)
// B写了一个程序,(需要内存:200M,需要核数:1个)
// c写了一个程序,(需要内存:300M,需要核数:2个)
// 都已经注册成功了,但是因为资源不够,都在处于等待状态,
// 这里要循环每个worker,只要worker处于活动状态,还没有运行任务,就尝试给它运行任务
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
// 如果worker空闲的内存 > driver描述的内存 worker空闲的核数 > driver描述的核数
// 假设现在worker空闲的内存是250M,核数是1个,那么符合条件的只要A和B,根据先来先运行,A去这个worker上运行,
// 那么worker还剩下空闲的内存是150M,核数是0个,B和C都要继续等待
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores)
{
/**
* 启动Driver程序,吧Driver放到executor中
*/
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
/**
* 上面每次都会循环一边,然后执行launchExecutor
* 启动executor,即为app分配资源
*/
startExecutorsOnWorkers()
}
========================
/** 注册worker节点
*
*/
* 注册WorkerInfo,其实就是将其添加到workers:HashSet[WorkerInfo]中,并且更新worker id 与worker以及worker address与worker的映射关系。
*
* 处理逻辑:
* 1.标记注册成功
* 2.调用changeMaster方法更新activeMasterUrl,activeMasterWebUiUrl,master,masterAddress等信息
* 3.启动定时调度给自己发送SendHeartbeat消息
*
~~~ # 源码提取说明:Master.scala
~~~ # 408行-471行
private def registerWorker(worker: WorkerInfo): Boolean = {
// There may be one or more refs to dead workers on this same node (w/ different ID's),remove them.
}
// 可能会有死的worker节点如果有就删除他
workers.filter { w =>
(w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
}.foreach { w => workers -= w
}
val workerAddress = worker.endpoint.address
if (addressToWorker.contains(workerAddress)) {
val oldWorker = addressToWorker(workerAddress)
if (oldWorker.state == WorkerState.UNKNOWN) {
// A worker registering from UNKNOWN implies that the worker was restarted during recovery. The old worker must thus be dead, so we will remove it and accept the new worker.
// 老的节点必须死掉,所以移除掉它接受新的节点
removeWorker(oldWorker)
} else {
logInfo("Attempted to re-register worker at same address: " + workerAddress)
return false
}
}
workers += worker
idToWorker(worker.id) = worker
addressToWorker(workerAddress) = worker
if (reverseProxy) {
webUi.addProxyTargets(worker.id, worker.webUiAddress)
}
true
}
private[deploy] object Master extends Logging {
val SYSTEM_NAME = "sparkMaster"
val ENDPOINT_NAME = "Master"
def main(argStrings: Array[String]) {
// 功能主要是设置一些平常的诊断状态,应该在main方法之前调用,在类unix系统上注册一个信号处理程序来记录信号
Utils.initDaemon(log)
val conf = new SparkConf
// MasterArguments 解析系统环境变量和启动Master时指定的命令行参数
val args = new MasterArguments(argStrings, conf)
val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
rpcEnv.awaitTermination()
}
/**
* Start the Master and return a three tuple of:
* (1) The Master RpcEnv
* (2) The web UI bound port
* (3) The REST server bound port, if any
*/
def startRpcEnvAndEndpoint(
host: String,
port: Int,
webUiPort: Int,
conf: SparkConf): (RpcEnv, Int, Option[Int]) = {
val securityMgr = new SecurityManager(conf)
// 创建rpcEnv主要是接收消息,然后交给backend处理
val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)
val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,
new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))
// 发送一条消息到相应的[[RpcEndpoint.receiveAndReply]] 。在默认超时时间内得到结果,如果失败,则抛出异常
// Endpoint启动时,会默认启动TransportServer,且启动结束后会进行一次同步测试rpc可用性(askSync-BoundPortsRequest)
val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest)
(rpcEnv, portsResponse.webUIPort, portsResponse.restPort)
}
}
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
浙公网安备 33010602011771号