版本定制第5课:基于案例一节课贯通Spark Streaming流计算框架的运行源码
Spark Streaming的Job到底是如何运行的,我们下面以一个例子来解析一下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | packagecom.dt.spark.streamingimportcom.dt.spark.common.ConnectPoolimportorg.apache.spark.SparkConfimportorg.apache.spark.streaming.{Seconds, StreamingContext}/** * 以网站热词排名为例,将处理结果写到MySQL中 * Created by dinglq on 2016/5/3. */objectWriteDataToMySQL {  defmain(args:Array[String]) {    valconf =newSparkConf().setAppName("WriteDataToMySQL")    valssc =newStreamingContext(conf,Seconds(5))    // 假设socket输入的数据格式为:searchKeyword,time    valItemsStream =ssc.socketTextStream("spark-master",9999)    // 将输入数据变成(searchKeyword,1)    varItemPairs =ItemsStream.map(line =>(line.split(",")(0),1))     valItemCount =ItemPairs.reduceByKeyAndWindow((v1:Int,v2:Int)=> v1+v2,Seconds(60),Seconds(10))    //ssc.checkpoint("/user/checkpoints/")    // val ItemCount = ItemPairs.reduceByKeyAndWindow((v1:Int,v2:Int)=> v1+v2,(v1:Int,v2:Int)=> v1-v2,Seconds(60),Seconds(10))    /**     * 接下来需要对热词的频率进行排序,而DStream没有提供sort的方法。那么我们可以实现transform函数,用RDD的sortByKey实现     */    valhottestWord =ItemCount.transform(itemRDD => {      valtop3=itemRDD.map(pair => (pair._2, pair._1))        .sortByKey(false).map(pair => (pair._2, pair._1)).take(3)      ssc.sparkContext.makeRDD(top3)    })    hottestWord.foreachRDD(rdd => {      rdd.foreachPartition(partitionOfRecords =>{        valconn =ConnectPool.getConnection        conn.setAutoCommit(false);  //设为手动提交        valstmt =conn.createStatement();        partitionOfRecords.foreach( record => {          stmt.addBatch("insert into searchKeyWord (insert_time,keyword,search_count) values (now(),'"+record._1+"','"+record._2+"')");        })        stmt.executeBatch();        conn.commit();  //提交事务      })    })    ssc.start()    ssc.awaitTermination()    ssc.stop()  }} | 
将代码提交至Spark 集群运行:
1.程序最初会初始化StreamingContext
| 1 2 3 | defthis(conf:SparkConf, batchDuration:Duration) ={  this(StreamingContext.createNewSparkContext(conf), null, batchDuration)} | 
在StreamingContext的构造方法中会新建一个SparkContext实例,从这点也可以说明Streaming是运行在Spark Core 之上的。
StreamingContext初始化的过程中会做如下事情
2.构造DStreamGraph
| 1 2 3 4 5 6 7 8 9 10 11 12 | private[streaming] valgraph:DStreamGraph ={  if(isCheckpointPresent) {    cp_.graph.setContext(this)    cp_.graph.restoreCheckpointData()    cp_.graph  } else{    require(batchDur_!=null, "Batch duration for StreamingContext cannot be null")    valnewGraph =newDStreamGraph()    newGraph.setBatchDuration(batchDur_)    newGraph  }} | 
3.构造JobScheduler对象
| 1 | private[streaming] valscheduler =newJobScheduler(this) | 
而在JobScheduler对象初始化的过程会构造如下对象:JobGenerator、StreamingListenerBus
4.构造JobGenerator对象(JobScheduler.scala的第50行)
| 1 | privatevaljobGenerator =newJobGenerator(this) | 
5.而JobGenerator在实例化时,则会构造一个RecurringTimer(JobGenerator.scala的第58行)
| 1 2 | privatevaltimer =newRecurringTimer(clock, ssc.graph.batchDuration.milliseconds,  longTime => eventLoop.post(GenerateJobs(newTime(longTime))), "JobGenerator") | 
6.构造StreamingListenerBus对象(JobScheduler.scala的第52行)
| 1 | vallistenerBus =newStreamingListenerBus() | 
到此,StreamingContext实例化的工作完成(以上只是说明了主要的对象生成,并不完整,请参考源代码)
7.定义输入流
| 1 | valItemsStream =ssc.socketTextStream("spark-master",9999) | 
8.此方法会生成一个SocketInputDStream
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | defsocketTextStream(    hostname:String,    port:Int,    storageLevel:StorageLevel =StorageLevel.MEMORY_AND_DISK_SER_2  ):ReceiverInputDStream[String] =withNamedScope("socket text stream") {  socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)}defsocketStream[T:ClassTag](    hostname:String,    port:Int,    converter:(InputStream) => Iterator[T],    storageLevel:StorageLevel  ):ReceiverInputDStream[T] ={  newSocketInputDStream[T](this, hostname, port, converter, storageLevel)} | 
SocketInputDStream的继承关系如下图:
9.在InputDStream的构造过程中,会将此输入流SocketInputDStream添加到DStreamGraph的inputStreams数据结构中(InputDStream.scala的第47行)
| 1 | ssc.graph.addInputStream(this) | 
并且InputDStream和和DStreamGraph实例相互引用(DStreamGraph的第83行)
| 1 2 3 4 5 6 | defaddInputStream(inputStream:InputDStream[_]) {  this.synchronized {    inputStream.setGraph(this)    inputStreams +=inputStream  }} | 
10.在ReceiverInputDStream构建的过程中会初始化一个ReceiverRateController
| 1 2 3 4 5 6 7 | overrideprotected[streaming] valrateController:Option[RateController] ={  if(RateController.isBackPressureEnabled(ssc.conf)) {    Some(newReceiverRateController(id, RateEstimator.create(ssc.conf, ssc.graph.batchDuration)))  } else{    None  }} | 
特别说明:在DStreamGraph中有个outputStreams,表示SparkStreaming程序的输出流,在需要数据输出时,例如print(最终也会调用foreachRDD方法),foreachRDD等都会讲此DStream注册给outputStreams。(DStream.scala的第684行)
| 1 2 3 4 5 6 | privatedefforeachRDD(    foreachFunc:(RDD[T], Time) => Unit,    displayInnerRDDOps:Boolean):Unit ={  newForEachDStream(this,    context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()} | 
这里的context就是StreamingContext。
11.将DStream注册给DStreamGraph(DStream.scala的第969行)
| 1 2 3 4 | private[streaming] defregister():DStream[T] ={  ssc.graph.addOutputStream(this)  this} | 
所以Streaming程序的整个业务代码,就是将InputDStream经过各种转换计算变成OutputDStream的过程。
12. StreamingContext启动
| 1 | ssc.start() | 
启动过程中,会判断StreamingContext的状态,它有三个状态INITIALIZED、ACTIVE、STOP。只有状态为INITAILIZED才允许启动。代码如下
StreamingContext.scala的第594行
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | defstart():Unit =synchronized {  state match{    caseINITIALIZED =>      startSite.set(DStream.getCreationSite())      StreamingContext.ACTIVATION_LOCK.synchronized {        StreamingContext.assertNoOtherContextIsActive()        try{          validate()          // Start the streaming scheduler in a new thread, so that thread local properties          // like call sites and job groups can be reset without affecting those of the          // current thread.          ThreadUtils.runInNewThread("streaming-start") {            sparkContext.setCallSite(startSite.get)            sparkContext.clearJobGroup()            sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")            scheduler.start()          }          state =StreamingContextState.ACTIVE        } catch{          caseNonFatal(e) =>            logError("Error starting the context, marking it as stopped", e)            scheduler.stop(false)            state =StreamingContextState.STOPPED            throwe        }        StreamingContext.setActiveContext(this)      }      shutdownHookRef =ShutdownHookManager.addShutdownHook(        StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)      // Registering Streaming Metrics at the start of the StreamingContext      assert(env.metricsSystem !=null)      env.metricsSystem.registerSource(streamingSource)      uiTab.foreach(_.attach())      logInfo("StreamingContext started")    caseACTIVE =>      logWarning("StreamingContext has already been started")    caseSTOPPED =>      thrownewIllegalStateException("StreamingContext has already been stopped")  }} | 
13.调用JobScheduler的start方法(scheduler.start())
JobScheduler.scala的第62行
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | defstart():Unit =synchronized {  if(eventLoop !=null) return// scheduler has already been started  logDebug("Starting JobScheduler")  eventLoop =newEventLoop[JobSchedulerEvent]("JobScheduler") {    overrideprotecteddefonReceive(event:JobSchedulerEvent):Unit =processEvent(event)    overrideprotecteddefonError(e:Throwable):Unit =reportError("Error in job scheduler", e)  }  eventLoop.start()  // attach rate controllers of input streams to receive batch completion updates  for{    inputDStream <- ssc.graph.getInputStreams    rateController <- inputDStream.rateController  } ssc.addStreamingListener(rateController)  listenerBus.start(ssc.sparkContext)  receiverTracker =newReceiverTracker(ssc)  inputInfoTracker =newInputInfoTracker(ssc)  receiverTracker.start()  jobGenerator.start()  logInfo("Started JobScheduler")} | 
14.在上段代码中,首先会构造一个EventLoop[JobSchedulerEvent]对象,并调用其start方法
| 1 | eventLoop.start() | 
15.让JobScheduler的StreamingListenerBus对象监听输入流的ReceiverRateController对象
| 1 2 3 4 |   for{    inputDStream <- ssc.graph.getInputStreams    rateController <- inputDStream.rateController  } ssc.addStreamingListener(rateController) | 
StreamingContext.scala的第536行
| 1 2 3 | defaddStreamingListener(streamingListener:StreamingListener) {  scheduler.listenerBus.addListener(streamingListener)} | 
16.调用StreamingListenerBus的start方法
| 1 | listenerBus.start(ssc.sparkContext) | 
17.实例化receiverTracker和InputInfoTracker,并调用receiverTracker的start方法
| 1 2 3 | receiverTracker =newReceiverTracker(ssc)inputInfoTracker =newInputInfoTracker(ssc)receiverTracker.start() | 
18.在receiverTracker的start方法中,会构造一个ReceiverTrackerEndpoint对象(ReceiverTracker.scala的第149行)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | /** Start the endpoint and receiver execution thread. */defstart():Unit =synchronized {  if(isTrackerStarted) {    thrownewSparkException("ReceiverTracker already started")  }  if(!receiverInputStreams.isEmpty) {    endpoint =ssc.env.rpcEnv.setupEndpoint(      "ReceiverTracker", newReceiverTrackerEndpoint(ssc.env.rpcEnv))    if(!skipReceiverLaunch) launchReceivers()    logInfo("ReceiverTracker started")    trackerState =Started  }} | 
19.获取各个InputDStream的receiver,并且在相应的worker节点启动Receiver 。ReceiverTracker.scala的第413行
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | /** * Get the receivers from the ReceiverInputDStreams, distributes them to the * worker nodes as a parallel collection, and runs them. */privatedeflaunchReceivers():Unit ={  valreceivers =receiverInputStreams.map(nis => {    valrcvr =nis.getReceiver()    rcvr.setReceiverId(nis.id)    rcvr  })  runDummySparkJob()  logInfo("Starting "+ receivers.length + " receivers")  endpoint.send(StartAllReceivers(receivers))} | 
20.ReceiverTrackerEndpoint接收到StartAllReceivers消息,并做如下处理
ReceiverTracker.scala的第449行
| 1 2 3 4 5 6 7 8 | caseStartAllReceivers(receivers) =>  valscheduledLocations =schedulingPolicy.scheduleReceivers(receivers, getExecutors)  for(receiver <- receivers) {    valexecutors =scheduledLocations(receiver.streamId)    updateReceiverScheduledExecutors(receiver.streamId, executors)    receiverPreferredLocations(receiver.streamId) =receiver.preferredLocation    startReceiver(receiver, executors)  } | 
在Executor上启动receiver,此处可以得知,receiver可以有多个
21.然后回到13步的代码,调用JobGenerator.start()
JobGenerator.scala的第78行
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | /** Start generation of jobs */defstart():Unit =synchronized {  if(eventLoop !=null) return// generator has already been started  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.  // See SPARK-10125  checkpointWriter  eventLoop =newEventLoop[JobGeneratorEvent]("JobGenerator") {    overrideprotecteddefonReceive(event:JobGeneratorEvent):Unit =processEvent(event)    overrideprotecteddefonError(e:Throwable):Unit ={      jobScheduler.reportError("Error in job generator", e)    }  }  eventLoop.start()  if(ssc.isCheckpointPresent) {    restart()  } else{    startFirstTime()  }} | 
22.构造EventLoop[JobGeneratorEvent],并调用其start方法
| 1 |   eventLoop.start() | 
23.判断当前程序启动时,是否使用Checkpoint数据做恢复,来选择调用restart或者startFirstTime方法。我们的代码将调用startFirstTime()
JobGenerator.scala的第190行
| 1 2 3 4 5 6 | privatedefstartFirstTime() {  valstartTime =newTime(timer.getStartTime())  graph.start(startTime - graph.batchDuration)  timer.start(startTime.milliseconds)  logInfo("Started JobGenerator at "+ startTime)} | 
24.调用DStreamGraph的start方法
| 1 2 3 4 5 6 7 8 9 10 11 | defstart(time:Time) {  this.synchronized {    require(zeroTime ==null, "DStream graph computation already started")    zeroTime =time    startTime =time    outputStreams.foreach(_.initialize(zeroTime))    outputStreams.foreach(_.remember(rememberDuration))    outputStreams.foreach(_.validateAtStart)    inputStreams.par.foreach(_.start())  }} | 
此时,InputDStream启动,并开始接收数据。
InputDStream和ReceiverInputDStream的start方法都是空的。
InputDStream.scala的第110行
| 1 2 | /** Method called to start receiving data. Subclasses must implement this method. */defstart() | 
ReceiverInputDStream.scala的第63行
| 1 2 | // Nothing to start or stop as both taken care of by the ReceiverTracker.defstart() {} | 
而SocketInputDStream没有定义start方法,所以
| 1 | inputStreams.par.foreach(_.start()) | 
并没有做任何的事情,那么输入流到底是怎么被触发并开始接收数据的呢?
我们再看上面的第20步:
| 1 | startReceiver(receiver, executors) | 
代码的具体实现在ReceiverTracker.scala的545行
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | privatedefstartReceiver(    receiver:Receiver[_],    scheduledLocations:Seq[TaskLocation]):Unit ={  defshouldStartReceiver:Boolean ={    // It's okay to start when trackerState is Initialized or Started    !(isTrackerStopping || isTrackerStopped)  }  valreceiverId =receiver.streamId  if(!shouldStartReceiver) {    onReceiverJobFinish(receiverId)    return  }  valcheckpointDirOption =Option(ssc.checkpointDir)  valserializableHadoopConf =    newSerializableConfiguration(ssc.sparkContext.hadoopConfiguration)  // Function to start the receiver on the worker node  valstartReceiverFunc:Iterator[Receiver[_]] => Unit =    (iterator:Iterator[Receiver[_]]) => {      if(!iterator.hasNext) {        thrownewSparkException(          "Could not start receiver as object not found.")      }      if(TaskContext.get().attemptNumber() ==0) {        valreceiver =iterator.next()        assert(iterator.hasNext ==false)        valsupervisor =newReceiverSupervisorImpl(          receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)        supervisor.start()        supervisor.awaitTermination()      } else{        // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.      }    }  // Create the RDD using the scheduledLocations to run the receiver in a Spark job  valreceiverRDD:RDD[Receiver[_]] =    if(scheduledLocations.isEmpty) {      ssc.sc.makeRDD(Seq(receiver), 1)    } else{      valpreferredLocations =scheduledLocations.map(_.toString).distinct      ssc.sc.makeRDD(Seq(receiver -> preferredLocations))    }  receiverRDD.setName(s"Receiver $receiverId")  ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")  ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))  valfuture =ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](    receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())  // We will keep restarting the receiver job until ReceiverTracker is stopped  future.onComplete {    caseSuccess(_) =>      if(!shouldStartReceiver) {        onReceiverJobFinish(receiverId)      } else{        logInfo(s"Restarting Receiver $receiverId")        self.send(RestartReceiver(receiver))      }    caseFailure(e) =>      if(!shouldStartReceiver) {        onReceiverJobFinish(receiverId)      } else{        logError("Receiver has been stopped. Try to restart it.", e)        logInfo(s"Restarting Receiver $receiverId")        self.send(RestartReceiver(receiver))      }  }(submitJobThreadPool)  logInfo(s"Receiver ${receiver.streamId} started")} | 
它会将Receiver封装成RDD,以Job的方式提交到Spark集群中。submitJob的第二个参数,是一个函数,它的功能是在worker节点上启动receiver
| 1 2 3 4 | valsupervisor =newReceiverSupervisorImpl(  receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)supervisor.start()supervisor.awaitTermination() | 
在supervisor.start方法中会调用如下代码
ReceiverSupervisor.scala的127行
| 1 2 3 4 5 | /** Start the supervisor */defstart() {  onStart()  startReceiver()} | 
onStart()方法是在ReceiverSupervisorImpl中实现的(ReceiverSupervisorImpl.scala的172行)
| 1 2 3 | overrideprotecteddefonStart() {  registeredBlockGenerators.foreach { _.start() }} | 
startReceiver代码如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | /** Start receiver */defstartReceiver():Unit =synchronized {  try{    if(onReceiverStart()) {      logInfo("Starting receiver")      receiverState =Started      receiver.onStart()      logInfo("Called receiver onStart")    } else{      // The driver refused us      stop("Registered unsuccessfully because Driver refused to start receiver "+ streamId, None)    }  } catch{    caseNonFatal(t) =>      stop("Error starting receiver "+ streamId, Some(t))  }} | 
首先会调用onReceiverStart方法
| 1 2 3 4 5 | overrideprotecteddefonReceiverStart():Boolean ={  valmsg =RegisterReceiver(    streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)  trackerEndpoint.askWithRetry[Boolean](msg)} | 
向ReceiverTrackerEndpoint发送消息,注册Receiver给ReceiverTracker。
在startReceiver中,会调用receiver的Onstart方法,启动receiver。
注:这里要弄清楚ReceiverInputDStream和Recevier的区别。Receiver是具体接收数据的,而ReceiverInputDStream是对Receiver做了一成封装,将数据转换成DStream 。
我们本例中的Receiver是通过SocketInputDStream的getReceiver方法获取的(在第19步的时候被调用)。
ReceiverInputDStream.scala的42行
| 1 2 3 | defgetReceiver():Receiver[T] ={  newSocketReceiver(host, port, bytesToObjects, storageLevel)} | 
而SocketReceiver会不断的从Socket中获取数据。
我们看看SocketReceiver的onStart方法:
| 1 2 3 4 5 6 7 | defonStart() {  // Start the thread that receives data over a connection  newThread("Socket Receiver") {    setDaemon(true)    overridedefrun() { receive() }  }.start()} | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | /** Create a socket connection and receive data until receiver is stopped */defreceive() {  varsocket:Socket =null  try{    logInfo("Connecting to "+ host + ":"+ port)    socket =newSocket(host, port)    logInfo("Connected to "+ host + ":"+ port)    valiterator =bytesToObjects(socket.getInputStream())    while(!isStopped && iterator.hasNext) {      store(iterator.next)    }    if(!isStopped()) {      restart("Socket data stream had no more data")    } else{      logInfo("Stopped receiving")    }  } catch{    casee:java.net.ConnectException =>      restart("Error connecting to "+ host + ":"+ port, e)    caseNonFatal(e) =>      logWarning("Error receiving data", e)      restart("Error receiving data", e)  } finally{    if(socket !=null) {      socket.close()      logInfo("Closed socket to "+ host + ":"+ port)    }  }} | 
到目前为止,我们的Receiver启动并接收数据啦。Receiver的启动是以Spark Job的方式启动的。
25. SparkStreaming是如何每个batchInterval时间提交Job的?
我们回到第22步,在JobGenerator启动的过程中创建了一个EventLoop[JobGeneratorEvent],并调用了start方法。代码如下:
| 1 2 3 4 5 6 7 8 | defstart():Unit ={  if(stopped.get) {    thrownewIllegalStateException(name + " has already been stopped")  }  // Call onStart before starting the event thread to make sure it happens before onReceive  onStart()  eventThread.start()} | 
它启动了一个eventThread线程,在其run方法中调用了EventLoop的onReceive方法
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | privatevaleventThread =newThread(name) {  setDaemon(true)  overridedefrun():Unit ={    try{      while(!stopped.get) {        valevent =eventQueue.take()        try{          onReceive(event)        } catch{          caseNonFatal(e) => {            try{              onError(e)            } catch{              caseNonFatal(e) => logError("Unexpected error in "+ name, e)            }          }        }      }    } catch{      caseie:InterruptedException => // exit even if eventQueue is not empty      caseNonFatal(e) => logError("Unexpected error in "+ name, e)    }  }} | 
而EventLoop的OnReceive方法如下:
| 1 2 3 4 5 6 7 | eventLoop =newEventLoop[JobGeneratorEvent]("JobGenerator") {  overrideprotecteddefonReceive(event:JobGeneratorEvent):Unit =processEvent(event)  overrideprotecteddefonError(e:Throwable):Unit ={    jobScheduler.reportError("Error in job generator", e)  }} | 
调用processEvent方法:
| 1 2 3 4 5 6 7 8 9 10 11 | /** Processes all events */privatedefprocessEvent(event:JobGeneratorEvent) {  logDebug("Got event "+ event)  event match{    caseGenerateJobs(time) => generateJobs(time)    caseClearMetadata(time) => clearMetadata(time)    caseDoCheckpoint(time, clearCheckpointDataLater) =>      doCheckpoint(time, clearCheckpointDataLater)    caseClearCheckpointData(time) => clearCheckpointData(time)  }} | 
在这里我们看到了,按照给定的时间去生成Jobs
| 1 | caseGenerateJobs(time) => generateJobs(time) | 
继续跟踪generateJobs:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | /** Generate jobs and perform checkpoint for the given `time`.  */privatedefgenerateJobs(time:Time) {  // Set the SparkEnv in this thread, so that job generation code can access the environment  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.  SparkEnv.set(ssc.env)  Try {    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch    graph.generateJobs(time) // generate jobs using allocated block  } match{    caseSuccess(jobs) =>      valstreamIdToInputInfos =jobScheduler.inputInfoTracker.getInfo(time)      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))    caseFailure(e) =>      jobScheduler.reportError("Error generating jobs for time "+ time, e)  }  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater =false))} | 
然后调用graph.generateJobs:
| 1 2 3 4 5 6 7 8 9 10 11 12 | defgenerateJobs(time:Time):Seq[Job] ={  logDebug("Generating jobs for time "+ time)  valjobs =this.synchronized {    outputStreams.flatMap { outputStream =>      valjobOption =outputStream.generateJob(time)      jobOption.foreach(_.setCallSite(outputStream.creationSite))      jobOption    }  }  logDebug("Generated "+ jobs.length + " jobs for time "+ time)  jobs} | 
在调用输出流的generateJob
| 1 2 3 4 5 6 7 8 9 10 11 12 | private[streaming] defgenerateJob(time:Time):Option[Job] ={  getOrCompute(time) match{    caseSome(rdd) => {      valjobFunc =() => {        valemptyFunc ={ (iterator:Iterator[T]) => {} }        context.sparkContext.runJob(rdd, emptyFunc)      }      Some(newJob(time, jobFunc))    }    caseNone => None  }} | 
这里先getOrCompute方法,生成指定时间的RDD:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | /** * Get the RDD corresponding to the given time; either retrieve it from cache * or compute-and-cache it. */private[streaming] finaldefgetOrCompute(time:Time):Option[RDD[T]] ={  // If RDD was already generated, then retrieve it from HashMap,  // or else compute the RDD  generatedRDDs.get(time).orElse {    // Compute the RDD if time is valid (e.g. correct time in a sliding window)    // of RDD generation, else generate nothing.    if(isTimeValid(time)) {      valrddOption =createRDDWithLocalProperties(time, displayInnerRDDOps =false) {        // Disable checks for existing output directories in jobs launched by the streaming        // scheduler, since we may need to write output to an existing directory during checkpoint        // recovery; see SPARK-4835 for more details. We need to have this call here because        // compute() might cause Spark jobs to be launched.        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {          compute(time)        }      }      rddOption.foreach { casenewRDD =>        // Register the generated RDD for caching and checkpointing        if(storageLevel !=StorageLevel.NONE) {          newRDD.persist(storageLevel)          logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")        }        if(checkpointDuration !=null&& (time - zeroTime).isMultipleOf(checkpointDuration)) {          newRDD.checkpoint()          logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")        }        generatedRDDs.put(time, newRDD)      }      rddOption    } else{      None    }  }} | 
终于,将job通过SparkContext的runJob方法提交给了Spark集群。

出处:http://lqding.blog.51cto.com/9123978/1771017
备注:
1、DT大数据梦工厂微信公众号DT_Spark 
2、IMF晚8点大数据实战YY直播频道号:68917580
3、新浪微博: http://www.weibo.com/ilovepains
 
                    
                
 
                
            
         
 浙公网安备 33010602011771号
浙公网安备 33010602011771号