Spark 源码系列 - MapPartitionsRDD & ShuffledRDD
目录
结论
两种方式都是包装模式,即传入对象自己,然后生成新的对象;
- 非shuffle类算子,每次调用创建 MapPartitionsRDD
- shuffle类算子,每次调用创建 ShuffledRDD
非shuffle类
val words = lines.flatMap(_.split("\\s+"))
val wordPair = words.map((_, 1))
RDD -> flatMap
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
// 生成新的MapPartitionsRDD,包装传入的rdd.
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
}
RDD -> map
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
}
shuffle类
val wordSum = wordPair.reduceByKey(_ + _)
val sortedResult = wordSum.sortBy(_._2)
PairRDDFunctions -> reduceByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
PairRDDFunctions -> combineByKeyWithClassTag
def combineByKeyWithClassTag[C](
...): RDD[(K, C)] = ... {
...
new ShuffledRDD[K, V, C](self, partitioner)
...
}
RDD -> sortBy
def sortBy[K](...)
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
OrderedRDDFunctions -> sortByKey
def sortByKey(...) {
...
new ShuffledRDD...
}
浙公网安备 33010602011771号