Spark版本定制第12天:Executor容错安全性
本期内容:
1 Executor WAL
2 消息重放
3 其他
一切不能进行实时流处理的数据都是无效的数据。在流处理时代,SparkStreaming有着强大吸引力,而且发展前景广阔,加之Spark的生态系统,Streaming可以方便调用其他的诸如SQL,MLlib等强大框架,它必将一统天下。
Spark Streaming运行时与其说是Spark Core上的一个流式处理框架,不如说是Spark Core上的一个最复杂的应用程序。如果可以掌握Spark streaming这个复杂的应用程序,那么其他的再复杂的应用程序都不在话下了。这里选择Spark Streaming作为版本定制的切入点也是大势所趋。
Executor 的数据安全容错非常重要(计算容错主要借助Spark Core的容错机制容错,天然就是容错的),这里的容错是数据的安全容错。
这里一共有俩种方式来写数据,一种是WriteAheadLogBasedBlockHandler,一种是BlockManagerBasedBlockHandler.这俩种方式保证了数据可以回放。并且在这个过程中,如果没有指定checkpoint的目录的话,会抛出异常。
在这里我们着重研究WAL的方式,也就是WriteAheadLogBasedBlockHandler
private[streaming] object WriteAheadLogBasedBlockHandler{
def checkpointDirToLogDir(checkpointDir:String, streamId: Int): String = {
new Path(checkpointDir,new Path("receivedData", streamId.toString)).toString
}
}
这里checkpointDirToLogDir创建了一个保存数据的路径。
def createLogForReceiver(
sparkConf: SparkConf,
fileWalLogDirectory: String,
fileWalHadoopConf: Configuration
): WriteAheadLog = {
createLog(false, sparkConf, fileWalLogDirectory, fileWalHadoopConf)
}
该方法的实质的调用了createLog方法
private def createLog(
isDriver: Boolean,
sparkConf: SparkConf,
fileWalLogDirectory: String,
fileWalHadoopConf: Configuration
): WriteAheadLog = {
val classNameOption= if (isDriver) {
sparkConf.getOption(DRIVER_WAL_CLASS_CONF_KEY)
} else {
sparkConf.getOption(RECEIVER_WAL_CLASS_CONF_KEY)
}
val wal =classNameOption.map { className =>
try {
instantiateClass(
Utils.classForName(className).asInstanceOf[Class[_ <: WriteAheadLog]], sparkConf)
} catch {
case NonFatal(e) =>
throw new SparkException(s"Couldnot create a write ahead log of class $className", e)
}
}.getOrElse {
new FileBasedWriteAheadLog(sparkConf,fileWalLogDirectory, fileWalHadoopConf,
getRollingIntervalSecs(sparkConf,isDriver), getMaxFailures(sparkConf, isDriver),
shouldCloseFileAfterWrite(sparkConf,isDriver))
}
if (isBatchingEnabled(sparkConf,isDriver)) {
new BatchedWriteAheadLog(wal,sparkConf)
} else {
wal
}
}
这里默认创建的是FileBasedWriteAheadLog。
再回到storeBlock中,这里有storeInBlockManagerFuture和storeInWriteAheadLogFuture两个方法。所以数据存入BlockManager和WAl同时进行。完成之后,就可以交给trackEndpoint消息循环体了
def storeBlock(blockId: StreamBlockId, block:ReceivedBlock): ReceivedBlockStoreResult = {
var numRecords= None: Option[Long]
// Serialize the block so that it can be inserted intoboth
val serializedBlock= block match {
case ArrayBufferBlock(arrayBuffer)=>
numRecords = Some(arrayBuffer.size.toLong)
blockManager.dataSerialize(blockId,arrayBuffer.iterator)
case IteratorBlock(iterator)=>
val countIterator= new CountingIterator(iterator)
val serializedBlock= blockManager.dataSerialize(blockId, countIterator)
numRecords = countIterator.count
serializedBlock
case ByteBufferBlock(byteBuffer)=>
byteBuffer
case _=>
throw new Exception(s"Could notpush $blockId to block manager, unexpected block type")
}
// Store the block in block manager
val storeInBlockManagerFuture= Future {
val putResult=
blockManager.putBytes(blockId,serializedBlock, effectiveStorageLevel, tellMaster = true)
if (!putResult.map{ _._1 }.contains(blockId)) {
throw new SparkException(
s"Could not store $blockId to block manager with storage level $storageLevel")
}
}
// Store the block in write ahead log
val storeInWriteAheadLogFuture= Future {
writeAheadLog.write(serializedBlock, clock.getTimeMillis())
}
// Combine the futures, wait for both to complete, andreturn the write ahead log record handle
val combinedFuture= storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
val walRecordHandle= Await.result(combinedFuture, blockStoreTimeout)
WriteAheadLogBasedStoreResult(blockId,numRecords, walRecordHandle)
}
备注:
资料来源于:DT_大数据梦工厂(Spark发行版本定制)
更多私密内容,请关注微信公众号:DT_Spark
如果您对大数据Spark感兴趣,可以免费听由王家林老师每天晚上20:00开设的Spark永久免费公开课,地址YY房间号:68917580
浙公网安备 33010602011771号