Spark 源码系列 - MapPartitionsRDD & ShuffledRDD

结论

两种方式都是包装模式,即传入对象自己,然后生成新的对象;

  • 非shuffle类算子,每次调用创建 MapPartitionsRDD
  • shuffle类算子,每次调用创建 ShuffledRDD

非shuffle类

    val words = lines.flatMap(_.split("\\s+")) 
    val wordPair = words.map((_, 1)) 

RDD -> flatMap

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    // 生成新的MapPartitionsRDD,包装传入的rdd.
    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
  }

RDD -> map

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
  }

shuffle类

val wordSum = wordPair.reduceByKey(_ + _)
val sortedResult = wordSum.sortBy(_._2)

PairRDDFunctions -> reduceByKey

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

PairRDDFunctions -> combineByKeyWithClassTag

  def combineByKeyWithClassTag[C](
      ...): RDD[(K, C)] = ... {
    ...
      new ShuffledRDD[K, V, C](self, partitioner)
        ...
    }

RDD -> sortBy

  def sortBy[K](...)
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

OrderedRDDFunctions -> sortByKey

  def sortByKey(...) {
    ...
    new ShuffledRDD...
  }
posted @ 2022-05-23 22:04  608088  阅读(52)  评论(0)    收藏  举报