spark 源码导读4 初探Graphx

2014年，对于spark来说，是非常重要的一年，先是跻身Apache顶级项目（TLP），成为ASF最活跃的项目之一，得到了业内广泛的支持——2014年12月发布的Spark 1.2版本包含了来自172位Contributor贡献的1000多个commits。正是在这一版中，GraphX结束alpha正式发布，同时提供了stable API，这意味着用户不需要担心现有代码以后会因API的变化而改动了。还有一些新的变化，如在mapReduceTriplets的注释中可以看到：

* This function is deprecated in 1.2.0 because of SPARK-3936. Use aggregateMessages instead.

基于spark之上GraphX使用Vertex Cut（即顶点切分）来进行分布式计算，在切分的时候会遵循几种不同的分区策略，定义在PartitionStrategy中，如下：

当我们在GraphX中开始一个程序时，首先需要生成一个Graph对象

val graph: Graph[(String, String), String]

Graph对象的生成实际调用的是Object Graph的apply方法，此方法中我们需要传入顶点、边等属性。

  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
  }

从上面代码可以看出，Graph类的真正实现是GraphImpl。

再来看一下class Graph，它由三个重要的属性vertices, edges, triplets

  /**
   * An RDD containing the vertices and their associated attributes.
   *
   * @note vertex ids are unique.
   * @return an RDD containing the vertices in this graph
   */
  @transient val vertices: VertexRDD[VD]

  /**
   * An RDD containing the edges and their associated attributes.  The entries in the RDD contain
   * just the source id and target id along with the edge data.
   *
   * @return an RDD containing the edges in this graph
   *
   * @see [[Edge]] for the edge type.
   * @see [[Graph#triplets]] to get an RDD which contains all the edges
   * along with their vertex data.
   *
   */
  @transient val edges: EdgeRDD[ED]

  /**
   * An RDD containing the edge triplets, which are edges along with the vertex data associated with
   * the adjacent vertices. The caller should use [[edges]] if the vertex data are not needed, i.e.
   * if only the edge data and adjacent vertex ids are needed.
   *
   * @return an RDD containing edge triplets
   *
   * @example This operation might be used to evaluate a graph
   * coloring where we would like to check that both vertices are a
   * different color.
   * {{{
   * type Color = Int
   * val graph: Graph[Color, Int] = GraphLoader.edgeListFile("hdfs://file.tsv")
   * val numInvalid = graph.triplets.map(e => if (e.src.data == e.dst.data) 1 else 0).sum
   * }}}
   */
  @transient val triplets: RDD[EdgeTriplet[VD, ED]]

　vertices 表示顶点RDD，类名称为：VertexRDD，它的属性有ID和点属性

abstract class VertexRDD[VD](
    @transient sc: SparkContext,
    @transient deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps)

　edges 表示边RDD，类名称为EdgeRDD，它的属性有源顶点ID, 目标顶点ID, 边属性

abstract class EdgeRDD[ED](
    @transient sc: SparkContext,
    @transient deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps)

case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
    var srcId: VertexId = 0,
    var dstId: VertexId = 0,
    var attr: ED = null.asInstanceOf[ED])
  extends Serializable

　triplets 表示顶点和边所有属性的合集，相当于对vertices, edges做了join操作

class EdgeTriplet[VD, ED] extends Edge[ED]

　跟Graph一样，上面的VertexRDD，EdgeRDD类的真正实现由相应的VertexRDDImpl, EdgeRDDImpl实现。

另外，还有一个重要的类GraphOps，它包括了一些对于每一个Graph Object会隐式调用的方法，比如：pageRank

最后，我用一张图来展示以上类关系。

posted on 2015-03-12 12:36 Ai_togic 阅读(255) 评论(0) 收藏举报

刷新页面返回顶部

Marshall

spark 源码导读4 初探Graphx

导航

公告