Spark源码阅读03：应用程序的执行

应用程序的执行

1. 概述

Driver线程主要是初始化SparkContext对象，准备运行所需的上下文，然后一方面保持与ApplicationMaster的RPC连接，通过ApplicationMaster申请资源，另一方面根据用户业务逻辑开始调度任务，将任务下发到已有的空闲Executor上

当ResourceManager向ApplicationMaster返回Container资源时，ApplicationMaster就尝试在对应的Container上启动Executor进程，Executor进程起来后，会向Driver反向注册，注册成功后保持与Driver的心跳，同时等待Driver分发任务，当分发的任务执行完毕后，将任务状态上报给Driver

Spark RDD通过其Transactions操作，形成了RDD血缘（依赖）关系图，即DAG，最后通过Action的调用，触发Job并调度执行，执行过程中会创建两个调度器：DAGScheduler和TaskScheduler。

DAGScheduler负责Stage级的调度，主要是将job切分成若干Stages，并将每个Stage打包成TaskSet交给TaskScheduler调度
TaskScheduler负责Task级的调度，将DAGScheduler给过来的TaskSet按照指定的调度策略分发到Executor上执行，调度过程中SchedulerBackend负责提供可用资源，其中SchedulerBackend有多种实现，分别对接不同的资源管理系统

Driver初始化SparkContext过程中，会分别初始化DAGScheduler、TaskScheduler、SchedulerBackend以及HeartbeatReceiver，并启动SchedulerBackend以及HeartbeatReceiver

org.apache.spark.SparkContext 类

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

Params:
config – a Spark Config object describing the application configuration. Any settings in this config overrides the default configs as well as system properties.

SparkContext 有一些重要的属性，比如SparkConf、SparkEnv、SchedulerBackend、TaskScheduler、DAGScheduler等

private var _conf: SparkConf = _  // 配置对象
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _listenerBus: LiveListenerBus = _
private var _env: SparkEnv = _  // 环境对象，比如通信环境
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _  // 调度的后台，主要用于和Executor间进行通信
private var _taskScheduler: TaskScheduler = _  // 任务调度器，主要用于任务的调度
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _  // DAG调度器，主要用于阶段的划分及任务的切分
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
private var _statusStore: AppStatusStore = _

SchedulerBackend通过ApplicationMaster申请资源，并不断从TaskScheduler中拿到合适的Task分发到Executor执行

HeartbeatReceiver负责接收Executor的心跳信息，监控Executor的存活状况，并通知到TaskScheduler

2. RDD依赖

查看一个转换算子，比如map算子发现返回的是一个MapPartitionsRDD，点进去发现它继承了RDD，而且第一个参数为this即把当前调用者包装了进去

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

查看RDD的构造方法，发现是把OneToOneDependency传了进去

def this(@transient oneParent: RDD[_]) =
  this(oneParent.context, List(new OneToOneDependency(oneParent)))

而OneToOneDependency继承了NarrowDependency，而发现这里的rdd恰恰就是调用者RDD

/**
 * :: DeveloperApi ::
 * Base class for dependencies where each partition of the child RDD depends on a small number
 * of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
 */
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd  // 调用者RDD，即父RDD
}

再来看一个groupBy算子

def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
    : RDD[(K, Iterable[T])] = withScope {
  val cleanF = sc.clean(f)
  this.map(t => (cleanF(t), t)).groupByKey(p)
}

进入groupByKey方法，再点combineByKeyWithClassTag

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

进入后发现它会判断分区器，不一致就会创建ShuffledRDD

if (self.partitioner == Some(partitioner)) {
  self.mapPartitions(iter => {
    val context = TaskContext.get()
    new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
  }, preservesPartitioning = true)
} else {
  new ShuffledRDD[K, V, C](self, partitioner)
    .setSerializer(serializer)
    .setAggregator(aggregator)
    .setMapSideCombine(mapSideCombine)
}

点击ShuffledRDD发现继承的RDD构造器的第二个参数为Nil，依赖为空？其实不是的。继续往下滑可以找到一个getDependencies方法，ShuffledRDD的依赖是通过它来得到的ShuffleDependency，创建时依旧是传入当前的调用者prev即父RDD

class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil)
...
override def getDependencies: Seq[Dependency[_]] = {
  val serializer = userSpecifiedSerializer.getOrElse {
    val serializerManager = SparkEnv.get.serializerManager
    if (mapSideCombine) {
      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
    } else {
      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
    }
  }
  List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}

3. 阶段划分

阶段级DagScheduler调度整体上的逻辑如下图所示

查看一个行动算子，比如collect，发现它调用了sc.runJob方法

def collect(): Array[T] = withScope {
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}

点进去，继续点runJob，进入dagScheduler.runJob方法，发现它调用了submitJob方法，即提交作业

val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)

点进submitJob，发现它向一个事件队列里添加了事件

private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
  extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging
	...
private[spark] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
	...
eventProcessLoop.post(JobSubmitted(
  jobId, rdd, func2, partitions.toArray, callSite, waiter,
  SerializationUtils.clone(properties)))
	...
  // => 点击post，发现其实底层eventProcessLoop是一个事件队列
  def post(event: E): Unit = {
    eventQueue.put(event)
  }

点DAGSchedulerEventProcessLoop找到onReceive方法，发现其实submitJob就是向eventProcessLoop发送了一条消息，然后当收到消息后，调用onReceive方法

override def onReceive(event: DAGSchedulerEvent): Unit = {
	val timerContext = timer.time()
	try {
	  doOnReceive(event)
	} finally {
	  timerContext.stop()
	}
}

执行doOnReceive方法，然后提交作业

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  // 提交作业
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

  case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

  case StageCancelled(stageId, reason) =>
    dagScheduler.handleStageCancellation(stageId, reason)

  case JobCancelled(jobId, reason) =>
    dagScheduler.handleJobCancellation(jobId, reason)

  case JobGroupCancelled(groupId) =>
    dagScheduler.handleJobGroupCancelled(groupId)

  case AllJobsCancelled =>
    dagScheduler.doCancelAllJobs()

  case ExecutorAdded(execId, host) =>
    dagScheduler.handleExecutorAdded(execId, host)

  case ExecutorLost(execId, reason) =>
    val workerLost = reason match {
      case SlaveLost(_, true) => true
      case _ => false
    }
    dagScheduler.handleExecutorLost(execId, workerLost)

  case WorkerRemoved(workerId, host, message) =>
    dagScheduler.handleWorkerRemoved(workerId, host, message)

  case BeginEvent(task, taskInfo) =>
    dagScheduler.handleBeginEvent(task, taskInfo)

  case SpeculativeTaskSubmitted(task) =>
    dagScheduler.handleSpeculativeTaskSubmitted(task)

  case GettingResultEvent(taskInfo) =>
    dagScheduler.handleGetTaskResult(taskInfo)

  case completion: CompletionEvent =>
    dagScheduler.handleTaskCompletion(completion)

  case TaskSetFailed(taskSet, reason, exception) =>
    dagScheduler.handleTaskSetFailed(taskSet, reason, exception)

  case ResubmitFailedStages =>
    dagScheduler.resubmitFailedStages()
}

进入dagScheduler.handleJobSubmitted方法，就是在这个地方进行阶段的划分

finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
	//=> 在createResultStage里会创建一个新的阶段
        val parents = getOrCreateParentStages(rdd, jobId)
			// =>
				// getShuffleDependencies: Returns shuffle dependencies that are immediate parents of the given RDD.
			    getShuffleDependencies(rdd).map { shuffleDep =>
                    // 获取或创建ShuffleMap阶段，即创建shuffle write
      				getOrCreateShuffleMapStage(shuffleDep, firstJobId)
                      // => 进入getOrCreateShuffleMapStage方法后再点createShuffleMapStage
                   			// 迭代创建一个新的阶段，即如果有多个Shuffle的情况，会迭代向前找然后创建阶段！！！
                    	    val parents = getOrCreateParentStages(rdd, jobId)
                            val id = nextStageId.getAndIncrement()
                    		// 创建shuffle write
                            val stage = new ShuffleMapStage(
                              id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)
                                        }.toList

        val id = nextStageId.getAndIncrement()
        // 创建一个新的阶段，这里的rdd即是当前处理的RDD
        val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
        stageIdToStage(id) = stage
        updateJobIdStageIdMaps(jobId, stage)

进入getShuffleDependencies方法，发现它会查找ShuffleDependencies，然后有一个就增加一个父依赖

toVisit.dependencies.foreach {
  case shuffleDep: ShuffleDependency[_, _, _] =>
    parents += shuffleDep
  case dependency =>
    waitingForVisit.push(dependency.rdd)
}

如果有多个Shuffle的情况，会迭代向前找然后创建阶段，所以说，Spark中阶段的划分等于Shuffle依赖的数量 +1，因为遇见行动算子的时候最后会增加一个阶段！

4. 任务切分

任务TaskScheduler调度整体上的逻辑如下图所示

然后让我们回退到创建完成最后一个阶段的位置

finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)

往下滑，我们会看见当阶段准备完成后，会新建一个job，最后对阶段进行提交

val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
  job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))

val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
  SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)

进入submitStage方法，发现它会递归进行阶段的提交，直到没有上一级，即提交最开始的那一个阶段

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug(s"submitStage($stage (name=${stage.name};" +
      s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)  
      logDebug("missing: " + missing)
      if (missing.isEmpty) {  // 如果没有上一阶段
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {  // 否则，提交上一阶段
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

点击进入submitMissingTasks方法，往下滑找到任务创建的代码块

val tasks: Seq[Task[_]] = try {
  val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
  stage match {
    case stage: ShuffleMapStage =>
      stage.pendingPartitions.clear()
      // partitionsToCompute: 分区的索引编号，然后根据 每个阶段最后一个RDD的分区索引 切分任务
      partitionsToCompute.map { id =>
        val locs = taskIdToLocations(id)
        val part = partitions(id)
        stage.pendingPartitions += id
        new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
          Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
      }

    case stage: ResultStage =>
      partitionsToCompute.map { id =>
        val p: Int = stage.partitions(id)
        val part = partitions(p)
        val locs = taskIdToLocations(id)
        new ResultTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, id, properties, serializedTaskMetrics,
          Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
          stage.rdd.isBarrier())
      }
  }
} catch {
  case NonFatal(e) =>
    abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
    runningStages -= stage
    return
}

这里涉及了一个ShuffleMapStage类，其实是通过它的findMissingPartitions方法来得到分区索引的

/** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */
override def findMissingPartitions(): Seq[Int] = {
  mapOutputTrackerMaster
    .findMissingPartitions(shuffleDep.shuffleId)
    .getOrElse(0 until numPartitions)
}

然后又找到了一个类org.apache.spark.MapOutputTrackerMaster，ShuffleMapStage其实是根据它完成创建任务

Driver-side class that keeps track of the location of the map output of a stage.

The DAGScheduler uses this class to (de)register map output statuses and to look up statistics for performing locality-aware reduce task scheduling.

ShuffleMapStage uses this class for tracking available / missing outputs in order to determine which tasks need to be run.

5. 任务调度

完成任务切分后，会将任务封装为TaskSet

if (tasks.size > 0) {
  logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
    s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
  taskScheduler.submitTasks(new TaskSet(
    tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
  // Because we posted SparkListenerStageSubmitted earlier, we should mark
  // the stage as completed here in case there are no tasks to run
  markStageAsFinished(stage, None)

  stage match {
    case stage: ShuffleMapStage =>
      logDebug(s"Stage ${stage} is actually done; " +
          s"(available: ${stage.isAvailable}," +
          s"available outputs: ${stage.numAvailableOutputs}," +
          s"partitions: ${stage.numPartitions})")
      markMapStageJobsAsFinished(stage)
    case stage : ResultStage =>
      logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
  }
  submitWaitingChildStages(stage)
}

点进submitTasks方法

override def submitTasks(taskSet: TaskSet) {
  val tasks = taskSet.tasks
  logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  this.synchronized {
    val manager = createTaskSetManager(taskSet, maxTaskFailures)
    val stage = taskSet.stageId
    val stageTaskSets =
      taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])

    // Mark all the existing TaskSetManagers of this stage as zombie, as we are adding a new one.
    // This is necessary to handle a corner case. Let's say a stage has 10 partitions and has 2
    // TaskSetManagers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10
    // and it completes. TSM2 finishes tasks for partition 1-9, and thinks he is still active
    // because partition 10 is not completed yet. However, DAGScheduler gets task completion
    // events for all the 10 partitions and thinks the stage is finished. If it's a shuffle stage
    // and somehow it has missing map outputs, then DAGScheduler will resubmit it and create a
    // TSM3 for it. As a stage can't have more than one active task set managers, we must mark
    // TSM2 as zombie (it actually is).
    stageTaskSets.foreach { case (_, ts) =>
      ts.isZombie = true
    }
    stageTaskSets(taskSet.stageAttemptId) = manager
    schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)  // 向任务池中添加TaskSetManager

    if (!isLocal && !hasReceivedTask) {
      starvationTimer.scheduleAtFixedRate(new TimerTask() {
        override def run() {
          if (!hasLaunchedTask) {
            logWarning("Initial job has not accepted any resources; " +
              "check your cluster UI to ensure that workers are registered " +
              "and have sufficient resources")
          } else {
            this.cancel()
          }
        }
      }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
    }
    hasReceivedTask = true
  }
  backend.reviveOffers()  // 从任务池中取出TaskSetManager
}

发现调用了createTaskSetManager方法，而在这个方法里又封装了一层TaskSetManager

// Label as private[scheduler] to allow tests to swap in different task set managers if necessary
private[scheduler] def createTaskSetManager(
    taskSet: TaskSet,
    maxTaskFailures: Int): TaskSetManager = {
  new TaskSetManager(this, taskSet, maxTaskFailures, blacklistTrackerOpt)
}

在得到manager之后，通过schedulableBuilder.addTaskSetManager来将它添加到调度器里，通过查看源码知道其实会根据不同的调度模式对schedulableBuilder进行赋值

def initialize(backend: SchedulerBackend) {
  this.backend = backend
  schedulableBuilder = {
    schedulingMode match {
      case SchedulingMode.FIFO =>
        new FIFOSchedulableBuilder(rootPool)
      case SchedulingMode.FAIR =>
        new FairSchedulableBuilder(rootPool, conf)
      case _ =>
        throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
        s"$schedulingMode")
    }
  }
  schedulableBuilder.buildPools()
}

而这里的schedulingMode来自一个配置文件，默认是 FIFO

// default scheduler is FIFO
private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)
val schedulingMode: SchedulingMode =
  try {
    SchedulingMode.withName(schedulingModeConf.toUpperCase(Locale.ROOT))
  } catch {
    case e: java.util.NoSuchElementException =>
      throw new SparkException(s"Unrecognized $SCHEDULER_MODE_PROPERTY: $schedulingModeConf")
  }

然后回退进入到schedulableBuilder.addTaskSetManager方法，发现它又会把manager放入到一个rootPool中，可以将其看作任务池，这个任务池里可以存放多个TaskSetManager

override def addTaskSetManager(manager: Schedulable, properties: Properties) {
  rootPool.addSchedulable(manager)
}

放进去了什么时候取出来呢？往下滑找到backend.reviveOffers()方法，进入发现它发送了一条ReviveOffers消息

override def reviveOffers() {
  driverEndpoint.send(ReviveOffers)
}

搜索找到

case ReviveOffers =>
  makeOffers()

进入makeOffers方法，如果任务不为空，在这里就会分配任务，进行任务的调度，分配完成后启动任务

// Make fake resource offers on all executors
private def makeOffers() {
  // Make sure no executor is killed while some task is launching on it
  val taskDescs = withLock {
    // Filter out executors under killing
    val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
    val workOffers = activeExecutors.map {
      case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores,
          Some(executorData.executorAddress.hostPort))
    }.toIndexedSeq
    scheduler.resourceOffers(workOffers)  // 分配任务
  }
  if (!taskDescs.isEmpty) {
    launchTasks(taskDescs)  // 启动任务
  }
}

查看resourceOffers的官方注释，发现它会将任务以轮询的方式均衡发送到集群的节点上

Called by cluster manager to offer resources on slaves. We respond by asking our active task sets for tasks in order of priority. We fill each node with tasks in a round-robin manner so that tasks are balanced across the cluster.

然后根据rootPool.getSortedTaskSetQueue得到的TaskSet序列进行调度

override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
  val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
  val sortedSchedulableQueue =
    schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
  for (schedulable <- sortedSchedulableQueue) {
    sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
  }
  sortedTaskSetQueue
}

// => 
private val taskSetSchedulingAlgorithm: SchedulingAlgorithm = {
    schedulingMode match {
      case SchedulingMode.FAIR =>
        new FairSchedulingAlgorithm()
      case SchedulingMode.FIFO =>
        new FIFOSchedulingAlgorithm()
      case _ =>
        val msg = s"Unsupported scheduling mode: $schedulingMode. Use FAIR or FIFO instead."
        throw new IllegalArgumentException(msg)
    }
  }

轮询集群节点发送任务集，发任务的时候还有个本地化级别的概念，”移动数据不如移动计算“

本地化级别

进程本地化：数据和计算在同一个进程中（效率最高）
节点本地化：数据和计算在同一个节点中
机架本地化：数据和计算在同一个机架中
任意

特殊情况可以降级来分配任务

// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
for (taskSet <- sortedTaskSets) {
  val availableSlots = availableCpus.map(c => c / CPUS_PER_TASK).sum
  // Skip the barrier taskSet if the available slots are less than the number of pending tasks.
  if (taskSet.isBarrier && availableSlots < taskSet.numTasks) {
    // Skip the launch process.
    // TODO SPARK-24819 If the job requires more slots than available (both busy and free
    // slots), fail the job on submit.
    logInfo(s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " +
      s"because the barrier taskSet requires ${taskSet.numTasks} slots, while the total " +
      s"number of available slots is $availableSlots.")
  } else {
    var launchedAnyTask = false
    // Record all the executor IDs assigned barrier tasks on.
    val addressesWithDescs = ArrayBuffer[(String, TaskDescription)]()
    for (currentMaxLocality <- taskSet.myLocalityLevels) {  // 判断本地化级别，即应该发到哪里
      var launchedTaskAtCurrentMaxLocality = false
      do {
        launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
          currentMaxLocality, shuffledOffers, availableCpus, tasks, addressesWithDescs)
        launchedAnyTask |= launchedTaskAtCurrentMaxLocality
      } while (launchedTaskAtCurrentMaxLocality)
    }

最后得到任务后，让我们回退到launchTasks(taskDescs)进行启动任务，点进去发现会将任务序列化之后发送到相应的Executor

// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = TaskDescription.encode(task)
    if (serializedTask.limit() >= maxRpcMessageSize) {
      Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
            "spark.rpc.message.maxSize or using broadcast variables for large values."
          msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    }
    else {
      val executorData = executorDataMap(task.executorId)
      executorData.freeCores -= scheduler.CPUS_PER_TASK

      logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
        s"${executorData.executorHost}.")

      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask))) // 发送任务到executor
    }
  }
}

6. 应用执行

在CoarseGrainedExecutorBackend的receive方法中收到消息后，将任务解码并启动

case LaunchTask(data) =>
  if (executor == null) {
    exitExecutor(1, "Received LaunchTask command but executor was null")
  } else {
    val taskDesc = TaskDescription.decode(data.value)  // 将任务反序列化，解码
    logInfo("Got assigned task " + taskDesc.taskId)
    executor.launchTask(this, taskDesc)  // 启动任务
  }

// => executor.launchTask，来一个task就启动一个线程来执行task
  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)  // 执行 task
  }

然后找到Executor.run()方法，往下滑，发现通过task.run来运行具体的任务，点进去找到调用runTask方法，再点进去发现它是一个抽象方法，即它是通过从Driver端得到的具体逻辑来执行具体的任务的

// Run the actual task and measure its runtime.
taskStartTime = System.currentTimeMillis()
taskStartCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
var threwException = true
val value = Utils.tryWithSafeFinally {
  val res = task.run(
    taskAttemptId = taskId,
    attemptNumber = taskDescription.attemptNumber,
    metricsSystem = env.metricsSystem)
  threwException = false
  res
}

posted @ 2022-12-11 18:37 黄一洋阅读(36) 评论(0) 收藏举报

刷新页面返回顶部

huangdb