虐翻 - WuLei吴磊

问安总一个问题，“怎么样结束Spark的远程调试？” 我以为会得到一个很简单的答案，比如一个操作，一个很简单的命令什么的，但是安总给我一个沉重的打击，从原理从底层结束了我的Spark执行进程。

秀了一波华丽丽的操作，期间我插了一句嘴，“安总，filter怎么用？”我以为会给我打个比方，结果给我讲解了一大段Spark源码！

：跟踪spark-submit，看到spark-class，jps，结束进程

Rdd filter源码：

filter

/**

* Return a new RDD containing only the elements that satisfy a predicate.

def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))

Filter是一个过滤操作，比如mapRDD.filter(_ >1)

distinct
返回RDD中元素去重后的RDD

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

最后我想说，我还是觉得徐总厉害一点。

发表于 2017-07-18 15:22 WuLei吴磊阅读(175) 评论(0) 收藏举报

虐翻

公告