Akka源码分析-Akka-Streams-GraphStage

　　上一篇博客中我们介绍了ActorMaterializer的一小部分源码，其实分析的还是非常简单的，只是初窥了Materializer最基本的初始化过程及其涉及的基本概念。我们知道在materialize过程中，对Graph进行了某种遍历，然后创建了actor，最终graph运行起来。那Graph相关的概念我们其实是没有进行深入研究的。但Graph定义又非常抽象，乍一看非常难于理解。但我在阅读官方文档的时候发现了自定义流处理过程的章节，这应该有助于我们理解Graph，此处对其做简要分析。

　　GraphStage抽象可以通过任意数量的输入输出端口，来创建任意操作。它是GraphDSL.create()方法的对应部分，这个方法是通过组合其他操作来创建新的流处理操作的。GraphStage不同之处在于，它创建一个不能分割的操作并且以安全的方式操作内部状态，怎么样是不是很像一个actor？嗯，没错其实在很久很久以前，GraphStage这个抽象是用actor来代替的。别问我为啥知道，看代码喽。

@deprecated("Use `akka.stream.stage.GraphStage` instead, it allows for all operations an Actor would and is more type-safe as well as guaranteed to be ReactiveStreams compliant.", since = "2.5.0")
trait ActorSubscriber extends Actor
@deprecated("Use `akka.stream.stage.GraphStage` instead, it allows for all operations an Actor would and is more type-safe as well as guaranteed to be ReactiveStreams compliant.", since = "2.5.0")
trait ActorPublisher[T] extends Actor

　　上面源码显示，在2.5.0版本之前，GraphStage被分为ActorSubscriber、ActorPublisher两个抽象，在2.5.0之后，这两个概念统一用GraphStage替换。那其实意味着，GraphStage既可以定义输出端口，也可以定义输入端口。

/**
 * A GraphStage represents a reusable graph stream processing operator.
 *
 * A GraphStage consists of a [[Shape]] which describes its input and output ports and a factory function that
 * creates a [[GraphStageLogic]] which implements the processing logic that ties the ports together.
 */
abstract class GraphStage[S <: Shape] extends GraphStageWithMaterializedValue[S, NotUsed]

　　官方注释显示，GraphStage代表一个可重用的图的流式处理操作（我们姑且成为算子吧）。它有一个Shape和一个工厂函数组成，Shape描述它的输入输出端口，工厂函数用来创建一个GraphStageLogic，而GraphStageLogic实现了与端口绑定的处理逻辑。

　　其实我们可以简单的把GraphStage理解为一个算子或操作，对数据处理的一个步骤，或简单的理解为面向过程编程中的一个函数。它有输入、输出、对数据的操作逻辑。

class NumbersSource extends GraphStage[SourceShape[Int]] {
  val out: Outlet[Int] = Outlet("NumbersSource")
  override val shape: SourceShape[Int] = SourceShape(out)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) {
      // All state MUST be inside the GraphStageLogic,
      // never inside the enclosing GraphStage.
      // This state is safe to access and modify from all the
      // callbacks that are provided by GraphStageLogic and the
      // registered handlers.
      private var counter = 1

      setHandler(out, new OutHandler {
        override def onPull(): Unit = {
          push(out, counter)
          counter += 1
        }
      })
    }
}

　　NumbersSource是官方的一个demo，这个Source是用来从1产生递增序列的，但可以在反压机制下停止产生数据。官网的注释也比较清楚，所有在GraphStageLogic里面的状态都是线程安全的，但仅仅相对于GraphStageLogic内部的回调函数。

　　NumbersSource还覆盖了一个shape字段，这个字段哪里来的呢？其实根据GraphStage的继承关系来看，它最终还继承了Graph这个trait，而这个trait是具有shape字段的，代表当前Graph的“形状”，这个形状的类型是GraphStage的类型参数决定的，也就是SourceShape[Int]。SourceShape[Int]代表一个只有输出没有输入的形状，且输出的数据类型是Int。

　　下面我们来看GraphStageLogic的定义，这个类还是比较重要的，因为它决定了数据的处理逻辑。

/**
 * Represents the processing logic behind a [[GraphStage]]. Roughly speaking, a subclass of [[GraphStageLogic]] is a
 * collection of the following parts:
 *  * A set of [[InHandler]] and [[OutHandler]] instances and their assignments to the [[Inlet]]s and [[Outlet]]s
 *    of the enclosing [[GraphStage]]
 *  * Possible mutable state, accessible from the [[InHandler]] and [[OutHandler]] callbacks, but not from anywhere
 *    else (as such access would not be thread-safe)
 *  * The lifecycle hooks [[preStart()]] and [[postStop()]]
 *  * Methods for performing stream processing actions, like pulling or pushing elements
 *
 * The operator logic is completed once all its input and output ports have been closed. This can be changed by
 * setting `setKeepGoing` to true.
 *
 * The `postStop` lifecycle hook on the logic itself is called once all ports are closed. This is the only tear down
 * callback that is guaranteed to happen, if the actor system or the materializer is terminated the handlers may never
 * see any callbacks to `onUpstreamFailure`, `onUpstreamFinish` or `onDownstreamFinish`. Therefore operator resource
 * cleanup should always be done in `postStop`.
 */
abstract class GraphStageLogic private[stream] (val inCount: Int, val outCount: Int)

　　GraphStageLogic定义了GraphStage背后的处理逻辑，粗略的说，GraphStageLogic的子类就是下面的集合：

InHandler和OutHandler实例的集合，以及他们给Inlet和Outlet的赋值。
可变状态（不必须），被InHandler和OutHandler回调函数存取，其他地方不能存取（否则就不是线程安全）。
生命周期hook，对preStart/postStop的hook。
实施流处理动作的方法，比如pull和push元素。

　　一旦输入输出端口完毕，算子逻辑就确定了。

  final protected def setHandler(out: Outlet[_], handler: OutHandler): Unit = {
    handlers(out.id + inCount) = handler
    if (_interpreter != null) _interpreter.setHandler(conn(out), handler)
  }

　　setHandler方法也比较简单，就是把OutHandler添加到handlers数组里面。_interpreter这个拦截器我们没有设置，所以应该是null。

/**
 * Collection of callbacks for an input port of a [[GraphStage]]
 */
trait InHandler {
  /**
   * Called when the input port has a new element available. The actual element can be retrieved via the
   * [[GraphStageLogic.grab()]] method.
   */
  @throws(classOf[Exception])
  def onPush(): Unit

  /**
   * Called when the input port is finished. After this callback no other callbacks will be called for this port.
   */
  @throws(classOf[Exception])
  def onUpstreamFinish(): Unit = GraphInterpreter.currentInterpreter.activeStage.completeStage()

  /**
   * Called when the input port has failed. After this callback no other callbacks will be called for this port.
   */
  @throws(classOf[Exception])
  def onUpstreamFailure(ex: Throwable): Unit = GraphInterpreter.currentInterpreter.activeStage.failStage(ex)
}

/**
 * Collection of callbacks for an output port of a [[GraphStage]]
 */
trait OutHandler {
  /**
   * Called when the output port has received a pull, and therefore ready to emit an element, i.e. [[GraphStageLogic.push()]]
   * is now allowed to be called on this port.
   */
  @throws(classOf[Exception])
  def onPull(): Unit

  /**
   * Called when the output port will no longer accept any new elements. After this callback no other callbacks will
   * be called for this port.
   */
  @throws(classOf[Exception])
  def onDownstreamFinish(): Unit = {
    GraphInterpreter
      .currentInterpreter
      .activeStage
      .completeStage()
  }
}

　　上面是InHandler和OutHandler的定义。OutHandler定义了一个onPull回调函数，根据注释，它之后在输出端口收到一个pull请求时才会被调用。还记得Akka Streams的设计哲学么，它是基于Reactive Streams的API来做抽象的，而且实现了背压机制，而且还不需要缓存数据，这个机制怎么实现呢？当然是一拉一推喽？啥意思？简单来说就是，下游消费者，会定期向上游pull一批数据，然后上游把指定数量的消息发送给下游，下游消费完这批数据后，根据自身的压力（或者消息的平均处理时间），计算下一次请求消息的数量。如果自身压力很小，那就一次性多请求一些数据，如果压力很大，那就把请求数据的数值设小一点。这样就可以实现背压机制了，而且无需缓存数据。所以这才有了pull和push。

　　在NumberSoure的OutHandler中收到pull请求时，也是通过调用push把数据发送给out端口的，然后计数器加1，就达到了生成自增数列的功能。那么push在哪里实现的呢？OutHandler并没有对应的方法啊。其实如果你对Java比较熟悉就知道在哪里定义了。

  /**
   * Emits an element through the given output port. Calling this method twice before a [[pull()]] has been arrived
   * will fail. There can be only one outstanding push request at any given time. The method [[isAvailable()]] can be
   * used to check if the port is ready to be pushed or not.
   */
  final protected def push[T](out: Outlet[T], elem: T): Unit = {
    val connection = conn(out)
    val it = interpreter
    val portState = connection.portState

    connection.portState = portState ^ PushStartFlip

    if ((portState & (OutReady | OutClosed | InClosed)) == OutReady && (elem != null)) {
      connection.slot = elem
      it.chasePush(connection)
    } else {
      // Restore state for the error case
      connection.portState = portState

      // Detailed error information should not add overhead to the hot path
      ReactiveStreamsCompliance.requireNonNullElement(elem)
      if (isClosed(out)) throw new IllegalArgumentException(s"Cannot push closed port ($out)")
      if (!isAvailable(out)) throw new IllegalArgumentException(s"Cannot push port ($out) twice, or before it being pulled")

      // No error, just InClosed caused the actual pull to be ignored, but the status flag still needs to be flipped
      connection.portState = portState ^ PushStartFlip
    }
  }

　　push通过给定的输出端口，把元素给发送刚出去。而且在收到下一个pull请求之前，重复调用push会失败。也就是说一个push对应一个pull请求。这段代码逻辑也比较清晰，其实就是获取一个connection，然后判断connection的状态是不是OutReady，如果是就把待发送的数据赋值给connection的slot字段。

  // Using common array to reduce overhead for small port counts
  private[stream] val portToConn = new Array[Connection](handlers.length)

　　通过跟踪我们发现，connection其实就是通过OutLet的id从上面这个数组中获取了一个值，但可惜的是，我们没有找到这个数组赋值的逻辑。其实这个也可以理解，毕竟我们都graph还没有编译，相关的参数没有很正常，关于这一点我们后面再分析。

  /**
   * INERNAL API
   *
   * Contains all the necessary information for the GraphInterpreter to be able to implement a connection
   * between an output and input ports.
   *
   * @param id Identifier of the connection.
   * @param inOwner The operator logic that corresponds to the input side of the connection.
   * @param outOwner The operator logic that corresponds to the output side of the connection.
   * @param inHandler The handler that contains the callback for input events.
   * @param outHandler The handler that contains the callback for output events.
   */
  final class Connection(
    var id:         Int,
    var inOwner:    GraphStageLogic,
    var outOwner:   GraphStageLogic,
    var inHandler:  InHandler,
    var outHandler: OutHandler) {
    var portState: Int = InReady
    var slot: Any = Empty

    override def toString =
      if (GraphInterpreter.Debug) s"Connection($id, $inOwner, $outOwner, $inHandler, $outHandler, $portState, $slot)"
      else s"Connection($id, $portState, $slot, $inHandler, $outHandler)"
  }

　　Connection其实可以理解成一个JavaBean，用来对相关的参数进行封装，而push仅仅是把待发送的数据赋值给slot，这就算发出去了？复制给slot之后，数据什么时候才被下游取走呢？

class StdoutSink extends GraphStage[SinkShape[Int]] {
  val in: Inlet[Int] = Inlet("StdoutSink")
  override val shape: SinkShape[Int] = SinkShape(in)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) {

      // This requests one element at the Sink startup.
      override def preStart(): Unit = pull(in)

      setHandler(in, new InHandler {
        override def onPush(): Unit = {
          println(grab(in))
          pull(in)
        }
      })
    }
}

　　其实官网，下面还有一个类，是一个Sink，可以看到在Sink的GraphStageLogic中，它是调用了grab获取了对应的数据。

 /**
   * Once the callback [[InHandler.onPush()]] for an input port has been invoked, the element that has been pushed
   * can be retrieved via this method. After [[grab()]] has been called the port is considered to be empty, and further
   * calls to [[grab()]] will fail until the port is pulled again and a new element is pushed as a response.
   *
   * The method [[isAvailable()]] can be used to query if the port has an element that can be grabbed or not.
   */
  final protected def grab[T](in: Inlet[T]): T = {
    val connection = conn(in)
    val it = interpreter
    val elem = connection.slot

    // Fast path
    if ((connection.portState & (InReady | InFailed)) == InReady && (elem.asInstanceOf[AnyRef] ne Empty)) {
      connection.slot = Empty
      elem.asInstanceOf[T]
    } else {
      // Slow path
      if (!isAvailable(in)) throw new IllegalArgumentException(s"Cannot get element from already empty input port ($in)")
      val failed = connection.slot.asInstanceOf[Failed]
      val elem = failed.previousElem.asInstanceOf[T]
      connection.slot = Failed(failed.ex, Empty)
      elem
    }
  }

　　grap其实就是通过Inlet获取了Connection然后取得了Connection的slot值，作为返回值。

　　这样大概就能梳理一下GraphStage的处理逻辑了。GraphStage是通过Connection作为“全局变量”来传递数据的，简单来说就是，source把待发送的数据设置给某个Connection的slot字段，sink从这个Connection的slot字段获取值，那么Source和Sink是如何绑定的呢？那就是ActorMaterializer的作用了，编译之后，Source和Sink才通过Connection进行绑定，而绑定的依据就是InPort和OutPort的ID，即具有相同ID的InPort和OutPort的Connection相同，这样就可以传递数据了。麻蛋，有点绕啊，究竟是不是这样，还得后续分析啊。

　　其实分析到这里，GraphStage的作用就已经很明显了，它是用来定义流处理中的算子的，可以把GraphStage理解成一个函数，它通过Shape定义输入输出的类型，通过GraphStageLogic定义函数体，通过Connection.slot返回值供其他函数访问。而Graph可以理解成函数的一连串调用，只不过调用逻辑比较复杂，不是线性那么简单，可能是一个DAG图。

　　为了与算子的端口（Inlet、Outlet）交互，我们需要可以接收和产生属于对应端口的事件。GraphStageLogi的输出端口可以做以下操作：

push(out,elem)。推送数据到输出端口，前提是下游端口发送了pull请求。
complete(out)。正常关闭输出端口。
fail(out,exception)。关闭输出端口，并提供一个失败的异常信息。
isAvailable(out)。判断当前端口是否可以推送数据。
isClosed(out)。判断当前端口是否已经关闭。关闭状态，端口不能推送数据也不能拉取数据。

　　与输出端口关联的事件可以在一个OutHandler实例中接收到。

　　输入端口可以进行的操作包括：

pull(in)。从熟读端口请求一个数据，前提是上游端口已经推送过一个数据。
grab(in)。在onPush回调时，获取一个数。不能重复调用。
cancel(in)。关闭输入端口
isAvailable(in)。判断当前端口是否可以获取（grab）数据。
hasBeenPulled(in)。判断当前端口是否已经拉取过数据。此状态无法调用pull拉取数据。
isClosed(in)。判断当前端口是否已经关闭。

　　当然了还有两个操作是输入和输出端口都可以进行的操作：

completeStage()。等同于关闭所有的输出端口，取消所有的输入端口。
failStage(exception)。等同于关闭所有的输出端口，取消所有的输入端口，并提供对应的失败异常信息。

class Map[A, B](f: A ⇒ B) extends GraphStage[FlowShape[A, B]] {

  val in = Inlet[A]("Map.in")
  val out = Outlet[B]("Map.out")

  override val shape = FlowShape.of(in, out)

  override def createLogic(attr: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) {
      setHandler(in, new InHandler {
        override def onPush(): Unit = {
          push(out, f(grab(in)))
        }
      })
      setHandler(out, new OutHandler {
        override def onPull(): Unit = {
          pull(in)
        }
      })
    }
}

　　上面是官网的一个稍微复杂点的demo它实现了map的功能，其实就是把指定的函数f应用于流入该stage的数据，然后push给下游。可以看到，这里同时设置了InHandler和OutHandler。

　　好了，由于时间关系，GraphStage就分析到这里，可以看到GraphStage是最终承担算子定义以及图的链接等功能的，可以说还是非常重要的一个概念，但离我们完全理解akka Stream各个概念的关系还比较远，加油吧，骚年。

Custom stream processing

posted @ 2018-08-29 14:33 gabry.wu 阅读(526) 评论(0) 收藏举报

刷新页面返回顶部

gabry.wu

Akka源码分析-Akka-Streams-GraphStage

公告