|NO.Z.00077|——————————|BigDataEnd|——|Hadoop&Spark.V03|——|Spark.v03|Spark 原理 源码|Master Worker解析&Master启动流程|

一、Master 启动流程
### --- Master启动流程

~~~     Master是RpcEndpoint,实现了 RpcEndpoint 接口
~~~     Master的生命周期遵循 constructor -> onStart -> receive* -> onStop 的步骤
~~~     Master 的 onStart 方法中最重要的事情是:执行恢复
### --- Master HA的实现方式:

~~~     ZOOKEEPER。基于zookeeper的Active / Standby 模式。
~~~     适用于生产模式,其基本原理是通过zookeeper来选举一个Master,其他的Master处于Standby状态;
~~~     FILESYSTEM。基于文件系统的单点恢复。主要用于开发或测试环境。
~~~     为Spark提供目录保存spark Application和worker的注册信息,一旦Master发生故障,
~~~     可通过重新启动Master进程(start-master.sh),恢复已运行的Spark Application和Worker 注册信息;
~~~     CUSTOM。允许用户自定义 HA 的实现,对于高级用户特别有用;
~~~     _。默应情况,未配置HA,不会持久化集群的数据,Master启动立即管理集群;
二、源码提取说明
### --- 源码提取说明:master.scala

~~~     # 源码提取说明:Master.scala
~~~     # 18行-111行
package org.apache.spark.deploy.master

  private[deploy] class Master(
                                override val rpcEnv: RpcEnv,
                                address: RpcAddress,
                                webUiPort: Int,
                                val securityMgr: SecurityManager,
                                val conf: SparkConf)
    extends ThreadSafeRpcEndpoint with Logging with LeaderElectable {

    private val forwardMessageThread =
      ThreadUtils.newDaemonSingleThreadScheduledExecutor("master-forward-message-thread")

    // 给hadoop添加一些配置
    private val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)

    // For application IDs  为了生成application的id
    private def createDateFormat = new SimpleDateFormat("yyyyMMddHHmmss", Locale.US)

    private val WORKER_TIMEOUT_MS = conf.getLong("spark.worker.timeout", 60) * 1000
    private val RETAINED_APPLICATIONS = conf.getInt("spark.deploy.retainedApplications", 200)
    private val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200)
    private val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
    // 如果想利用Spark本身实现选举和故障恢复,可以设置spark.deploy.recoveryMode为FILESYSTEM.
    // 两种:Zookeeper和FILESYSTEM.
    // Zookeeper:需要设置属性spark.deploy.zookeeper.url==>zookeeper集群的地址
    // spark.deploy.zookeeper.dir ===》 spark在zookeeper集群上注册节点的路径
    // FILESYSTEM:文件系统故障恢复模式
    // spark.deploy.recoveryMode ==》 FILESYSTEM
    // spark.deploy.recoveryDirectory ==》 存储Spark的Master节点的相关信息,以便于故障恢复使用,此种模式可以结合NFS等共享文件系统,
    // 可以再当前Master进程挂掉时候,另外一个Master进程继续提供服务。
    private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
    private val MAX_EXECUTOR_RETRIES = conf.getInt("spark.deploy.maxExecutorRetries", 10)

    val workers = new HashSet[WorkerInfo]
    val idToApp = new HashMap[String, ApplicationInfo]
    private val waitingApps = new ArrayBuffer[ApplicationInfo]
    val apps = new HashSet[ApplicationInfo]

    private val idToWorker = new HashMap[String, WorkerInfo]
    private val addressToWorker = new HashMap[RpcAddress, WorkerInfo]

    private val endpointToApp = new HashMap[RpcEndpointRef, ApplicationInfo]
    private val addressToApp = new HashMap[RpcAddress, ApplicationInfo]
    private val completedApps = new ArrayBuffer[ApplicationInfo]
    private var nextAppNumber = 0

    private val drivers = new HashSet[DriverInfo]
    private val completedDrivers = new ArrayBuffer[DriverInfo]
    // Drivers currently spooled for scheduling
    private val waitingDrivers = new ArrayBuffer[DriverInfo]
    private var nextDriverNumber = 0

    Utils.checkHost(address.host)

    /*
    Spark测量系统 MetricsSystem
    Spark测量系统,由指定的instance创建,由source、sink组成,周期性地从source获取指标然后发送到sink,其中instance、source、sink的概念如下:
    Instance:指定了谁在使用测量系统,在spark中有一些列如master、worker、executor、client driver这些角色,这些角色创建测量系统用于监控spark状态,目前在spark中已经实现的角色包括master、worker、executor、driver、applications
    Source:指定了从哪里收集测量数据。在Spark测量系统中有两种来源:
    (1) Spark内部指标,比如MasterSource、WorkerSource等,这些源将会收集Spark组件的内部状态
    (2) 普通指标,比例JvmSource,通过配置文件进行配置
    Sink:指定了往哪里输出测量数据
    */
    private val masterMetricsSystem = MetricsSystem.createMetricsSystem("master", conf, securityMgr)
    // 应用程序的监测系统
    private val applicationMetricsSystem = MetricsSystem.createMetricsSystem("applications", conf, securityMgr)
    private val masterSource = new MasterSource(this)

    // After onStart, webUi will be set
    private var webUi: MasterWebUI = null

    private val masterPublicAddress = {
      val envVar = conf.getenv("SPARK_PUBLIC_DNS")
      if (envVar != null) envVar else address.host
    }

    private val masterUrl = address.toSparkURL
    private var masterWebUiUrl: String = _

    private var state = RecoveryState.STANDBY

    // 持久化引擎,主要是处理 Master节点/worker节点新创建/移除的时候 持久化一些状态,用于故障恢复
    private var persistenceEngine: PersistenceEngine = _

    // 选举代理人的主要接口(特质)
    private var leaderElectionAgent: LeaderElectionAgent = _

    private var recoveryCompletionTask: ScheduledFuture[_] = _

    private var checkForWorkerTimeOutTask: ScheduledFuture[_] = _

    // As a temporary workaround before better ways of configuring memory, we allow users to set
    // a flag that will perform round-robin scheduling across the nodes (spreading out each app
    // among all the nodes) instead of trying to consolidate each app onto a small # of nodes.
    private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)

    // Default maxCores for applications that don't specify it (i.e. pass Int.MaxValue)
    // spark默认部署的spark.deploy.defaultCores core核数
    private val defaultCores = conf.getInt("spark.deploy.defaultCores", Int.MaxValue)
    val reverseProxy = conf.getBoolean("spark.ui.reverseProxy", false)
    if (defaultCores < 1) {
      throw new SparkException("spark.deploy.defaultCores must be positive")
    }

    // Alternative application submission gateway that is stable across Spark versions
    private val restServerEnabled = conf.getBoolean("spark.master.rest.enabled", false)
    private var restServer: Option[StandaloneRestServer] = None
    private var restServerBoundPort: Option[Int] = None
~~~     # 源码提取说明:Master.scala
~~~     # 112行-214行

    {
      val authKey = SecurityManager.SPARK_AUTH_SECRET_CONF
      require(conf.getOption(authKey).isEmpty || !restServerEnabled,
        s"The RestSubmissionServer does not support authentication via ${authKey}.  Either turn " +
          "off the RestSubmissionServer with spark.master.rest.enabled=false, or do not use " +
          "authentication.")
    }

    override def onStart(): Unit = {
      logInfo("Starting Spark master at " + masterUrl)
      logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")

      /** 创建MasterUI */
      webUi = new MasterWebUI(this, webUiPort)
      webUi.bind()
      masterWebUiUrl = "http://" + masterPublicAddress + ":" + webUi.boundPort

      if (reverseProxy) {
        masterWebUiUrl = conf.get("spark.ui.reverseProxyUrl", masterWebUiUrl)
        webUi.addProxy()
        logInfo(s"Spark Master is acting as a reverse proxy. Master, Workers and " +
          s"Applications UIs are available at $masterWebUiUrl")
      }
      /** 定时任务 WORKER_TIMEOUT_MS=60,每60秒执行一次任务 检查worker节点是否超时 */
      checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          // self是endpointRef,定时发送 masterEndpoint 的CheckForWorkerTimeOut信息
          // 主要是检查 移除 超时没有发送心跳的 worker节点
          self.send(CheckForWorkerTimeOut)
        }
      }, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)

      // 默认为true
      if (restServerEnabled) {
        val port = conf.getInt("spark.master.rest.port", 6066)
        restServer = Some(new StandaloneRestServer(address.host, port, conf, self, masterUrl))
      }
      restServerBoundPort = restServer.map(_.start())

      // 主节点注册资源 格式如 <app ID>.<executor ID (or "driver")>.<source name>.
      masterMetricsSystem.registerSource(masterSource)
      // 启动主节点的 MetricsSystem 只能启动一次
      masterMetricsSystem.start()
      // 启动应用程序的监测系统
      applicationMetricsSystem.start()
      // Attach the master and app metrics servlet handler to the web ui after the metrics systems are
      // started.
      masterMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)
      applicationMetricsSystem.getServletHandlers.foreach(webUi.attachHandler)

      // 以上都是启动master和app的元数据统计系统
      // java的序列化框架
      val serializer = new JavaSerializer(conf)
      // 持久化引擎,主要是处理 Master节点/worker节点新创建/移除的时候 持久化一些状态,用于故障恢复
      val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
        case "ZOOKEEPER" =>
          logInfo("Persisting recovery state to ZooKeeper")
          val zkFactory =
            new ZooKeeperRecoveryModeFactory(conf, serializer)
          (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
        case "FILESYSTEM" =>
          val fsFactory =
            new FileSystemRecoveryModeFactory(conf, serializer)
          (fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
        case "CUSTOM" =>
          val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))
          val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer])
            .newInstance(conf, serializer)
            .asInstanceOf[StandaloneRecoveryModeFactory]
          (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
        case _ =>
          (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
      }
      persistenceEngine = persistenceEngine_
      // 主要通过利用Curator框架进行主备切换 leaderElectionAgent在ZooKeeperLeaderElectionAgent类中
      leaderElectionAgent = leaderElectionAgent_
    }

    override def onStop() {
      ... ...
    }

    override def electedLeader() {
      ... ...
    }

    override def revokedLeadership() {
      ... ...
    }
    /**
     * master的消息处理
     * Master接收到ElectedLeader消息后会进行选举操作,由于local-cluster模式中只有一个Master,所以persistenceEngine没有持久化的App,
     * Driver,Worker的信息,所以当前Master即为激活(ALIVE)状态的。这里涉及到Master的故障恢复。
     */
    override def receive: PartialFunction[Any, Unit] = {
      case ElectedLeader => ...

      case CompleteRecovery => completeRecovery()

      case RevokedLeadership =>
        logError("Leadership has been revoked -- master shutting down.")
        System.exit(0)
~~~     # 源码提取说明:Master.scala
~~~     # 215行-302行

      /** 等待work注册上来
       * Master收到RegisterWorker消息后的处理步骤:
       * 1.创建WorkerInfo。
       * 2.注册WorkerInfo.
       * 3.向Worker发送RegisteredWorker消息,表示注册完成。
       * 4.调用schedule方法进行资源调度。
       */

      case RegisterWorker(
      id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
        logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
          workerHost, workerPort, cores, Utils.megabytesToString(memory)))
        // 如果该Master不是active,不做任何操作,返回
        // 如果注册过该worker id,向sender返回错误
        // worker节点正在恢复中
        if (state == RecoveryState.STANDBY) {
          workerRef.send(MasterInStandby)
          // workerRef.send()==>RpcEndpointRef.send()===>NettyRpcEnv.send()
        } else if (idToWorker.contains(id)) {
          workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
          // worker的id重复
          // worker 节点不在恢复中
        } else {
          val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
            workerRef, workerWebUiUrl)
          /** registerWorker(worker):注册WorkerInfo,其实就是将其添加到workers:HashSet[WorkerInfo]中,
           * 并且更新worker id与worker以及worker address与worker的映射关系
           */
          if (registerWorker(worker)) {
            // 如果worker节点注册成功,就持久化worker节点的相关信息,为了以后恢复
            persistenceEngine.addWorker(worker)
            // 由worker.handleRegisterResponse(msg: RegisterWorkerResponse)方法去处理,
            workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
            schedule()
          } else {
            // 如果worker注册失败,发送消息到sender
            val workerAddress = worker.endpoint.address
            logWarning("Worker registration failed. Attempted to re-register worker at same " +
              "address: " + workerAddress)
            workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
              + workerAddress))
          }
        }

      // 注册应用程序App
      // 如果是standby,那么忽略这个消息;否则注册application;返回结果并且开始调度
      case RegisterApplication(description, driver) =>
        // TODO Prevent repeated registrations from some driver
        // 若之前注册过
        if (state == RecoveryState.STANDBY) {
          // ignore, don't send response
          // StandBy的master会忽略注册消息。如果AppClient发送请求道Standby的master
          // 会粗发超时机制(默认是20秒),超时会重试
        } else {
          logInfo("Registering app " + description.name)
          //创建app
          //创建App应用程序 包括创建时间 appID ,描述,驱动driver,默认的cores(defaultCores)
          val app = createApplication(description, driver)
          // 保存Application相关信息,主要是保存到waitingApps列表中,供待会调用
          registerApplication(app)
          logInfo("Registered app " + description.name + " with ID " + app.id)
          // 将此App的信息持久化到ZK中
          persistenceEngine.addApplication(app)
          driver.send(RegisteredApplication(app.id, self))
          // 调度driver和app
          schedule()
        }

      case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>

      case DriverStateChanged(driverId, state, exception) =>

      case Heartbeat(workerId, worker) =>
        idToWorker.get(workerId) match {
          // 如果worker节点心跳来了
          case Some(workerInfo) =>
            // 记录worker节点的最后心跳时间
            workerInfo.lastHeartbeat = System.currentTimeMillis()
          case None =>
            if (workers.map(_.id).contains(workerId)) {
              logWarning(s"Got heartbeat from unregistered worker $workerId." +
                " Asking it to re-register.")
              worker.send(ReconnectWorker(masterUrl))
            } else {
              logWarning(s"Got heartbeat from unregistered worker $workerId." +
                " This worker was never registered, so ignoring the heartbeat.")
            }
        }
~~~     # 源码提取说明:Master.scala
~~~     # 303行-407行

      // 将appId的app的状态置为WAITING,为切换Master做准备。
      case MasterChangeAcknowledged(appId) =>  ...
      // worker节点的调度器状态相应
      // 通过workerId查找到worker,那么worker的state置为ALIVE,
      // 并且查找状态为idDefined的executors,并且将这些executors都加入到app中,
      // 然后保存这些app到worker中。可以理解为Worker在Master端的Recovery
      case WorkerSchedulerStateResponse(workerId, executors, driverIds) => ... ...

      // worker节点的最后状态
      case WorkerLatestState(workerId, executors, driverIds) => ... ...

      // 没有注册的应用程序
      case UnregisterApplication(applicationId) => ... ...
      case CheckForWorkerTimeOut => ... ...

      //检查 移除 超时没有发送心跳的 worker节点
      timeOutDeadWorkers()
    }

    override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
      case RequestSubmitDriver(description) => ... ...

      // 请求杀死driver
      case RequestKillDriver(driverId) => ... ...

      // request driver status
      // 查找请求的driver,如果找到则返回driver的状态
      case RequestDriverStatus(driverId) => ... ...

      //请求master的状态
      //向sender返回master的状态
      case RequestMasterState => ... ...
    }

    // 主节点链接失效
    override def onDisconnected(address: RpcAddress): Unit = ... ...
    /**
     * Schedule the currently available resources among waiting apps. This method will becalled every time a new app joins or resource availability changes.
     * 目前可用的资源在等待调度程序。每当新应用程序连接或资源可用性更改时,该方法将被调用。
     *
     * 为Application分配资源选择Worker(Executor),现在有两种策略:
     *1.尽量打散。将一个Application尽可能的分配到不同的节点,可以通过设置spark.deploy.spreadOut来实现,默认值为true
     *2.尽量集中。将一个Application尽可能de 分配到尽可能少的节点,CPU密集型而内存占用比较少的Application比较适合这种策略
     */
    private def schedule(): Unit = {
      if (state != RecoveryState.ALIVE) {
        return
      }
      // 得到活动状态的Worker
      val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state ==WorkerState.ALIVE))
      val numWorkersAlive = shuffledAliveWorkers.size
      var curPos = 0
      for (driver <- waitingDrivers.toList) {
        // 如果是Standalone部署下的client方式提交的话,则driver的运行时本地提交节点
        // 如果是cluster方式提交的话,则driver的运行节点是有master调度的
        // iterate over a copy of waitingDrivers
        // We assign workers to each waiting driver in a round-robin fashion. For eachdriver, we
        // start from the last worker that was assigned a driver, and continue onwards untilwe have
        // explored all alive workers.
        // 迭代等待驱动程序的副本,我们以循环的方式分配给每个等待的驱动程序。对于每个driver,
        // 我们从最后一个被分配driver的worker开始,然后继续向前,直到我们探索所有活着的worker。
        var launched = false
        var numWorkersVisited = 0
        // 这里循环每个driver,比如
        // A写了一个程序,(需要内存:100M,需要核数:1个)
        // B写了一个程序,(需要内存:200M,需要核数:1个)
        // c写了一个程序,(需要内存:300M,需要核数:2个)
        // 都已经注册成功了,但是因为资源不够,都在处于等待状态,
        // 这里要循环每个worker,只要worker处于活动状态,还没有运行任务,就尝试给它运行任务
        while (numWorkersVisited < numWorkersAlive && !launched) {
          val worker = shuffledAliveWorkers(curPos)
          numWorkersVisited += 1
          // 如果worker空闲的内存 > driver描述的内存 worker空闲的核数 > driver描述的核数
          // 假设现在worker空闲的内存是250M,核数是1个,那么符合条件的只要A和B,根据先来先运行,A去这个worker上运行,
          // 那么worker还剩下空闲的内存是150M,核数是0个,B和C都要继续等待
          if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores)
          {
            /**
             * 启动Driver程序,吧Driver放到executor中
             */
            launchDriver(worker, driver)
            waitingDrivers -= driver
            launched = true
          }
          curPos = (curPos + 1) % numWorkersAlive
        }
      }
      /**
       * 上面每次都会循环一边,然后执行launchExecutor
       * 启动executor,即为app分配资源
       */
      startExecutorsOnWorkers()
    }
    ========================
    /** 注册worker节点
     *
     */
    * 注册WorkerInfo,其实就是将其添加到workers:HashSet[WorkerInfo]中,并且更新worker id 与worker以及worker address与worker的映射关系。
    *
    * 处理逻辑:
    * 1.标记注册成功
    * 2.调用changeMaster方法更新activeMasterUrl,activeMasterWebUiUrl,master,masterAddress等信息
    * 3.启动定时调度给自己发送SendHeartbeat消息
    *
~~~     # 源码提取说明:Master.scala
~~~     # 408行-471行

    private def registerWorker(worker: WorkerInfo): Boolean = {
      // There may be one or more refs to dead workers on this same node (w/ different ID's),remove them.
      }
    // 可能会有死的worker节点如果有就删除他
    workers.filter { w =>
      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
    }.foreach { w => workers -= w
    }
    val workerAddress = worker.endpoint.address
    if (addressToWorker.contains(workerAddress)) {
      val oldWorker = addressToWorker(workerAddress)
      if (oldWorker.state == WorkerState.UNKNOWN) {
        // A worker registering from UNKNOWN implies that the worker was restarted during recovery. The old worker must thus be dead, so we will remove it and accept the new worker.
        // 老的节点必须死掉,所以移除掉它接受新的节点
        removeWorker(oldWorker)
      } else {
        logInfo("Attempted to re-register worker at same address: " + workerAddress)
        return false
      }
    }
    workers += worker
    idToWorker(worker.id) = worker
    addressToWorker(workerAddress) = worker
    if (reverseProxy) {
      webUi.addProxyTargets(worker.id, worker.webUiAddress)
    }
    true
  }

  private[deploy] object Master extends Logging {
    val SYSTEM_NAME = "sparkMaster"
    val ENDPOINT_NAME = "Master"
    def main(argStrings: Array[String]) {
      // 功能主要是设置一些平常的诊断状态,应该在main方法之前调用,在类unix系统上注册一个信号处理程序来记录信号
      Utils.initDaemon(log)
      val conf = new SparkConf
      // MasterArguments 解析系统环境变量和启动Master时指定的命令行参数
      val args = new MasterArguments(argStrings, conf)
      val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
      rpcEnv.awaitTermination()
    }

    /**
     * Start the Master and return a three tuple of:
     * (1) The Master RpcEnv
     * (2) The web UI bound port
     * (3) The REST server bound port, if any
     */
    def startRpcEnvAndEndpoint(
                                host: String,
                                port: Int,
                                webUiPort: Int,
                                conf: SparkConf): (RpcEnv, Int, Option[Int]) = {
      val securityMgr = new SecurityManager(conf)
      // 创建rpcEnv主要是接收消息,然后交给backend处理
      val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)
      val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,
        new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))
      // 发送一条消息到相应的[[RpcEndpoint.receiveAndReply]] 。在默认超时时间内得到结果,如果失败,则抛出异常
      // Endpoint启动时,会默认启动TransportServer,且启动结束后会进行一次同步测试rpc可用性(askSync-BoundPortsRequest)
      val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest)
      (rpcEnv, portsResponse.webUIPort, portsResponse.restPort)
    }
  }

 
 
 
 
 
 
 
 
 

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
                                                                                                                                                   ——W.S.Landor

 

 

posted on 2022-04-12 13:42  yanqi_vip  阅读(38)  评论(0)    收藏  举报

导航