akka stream第四课-缓冲区与速率、上下文传播

Buffers and working with rate

缓冲区和工作速率

Dependency

To use Akka Streams, add the module to your project:

要使用Akka Streams,请将模块添加到您的项目中:

val AkkaVersion = "2.6.9"
libraryDependencies += "com.typesafe.akka" %% "akka-stream" % AkkaVersion

 

Introduction

When upstream and downstream rates differ, especially when the throughput has spikes, it can be useful to introduce buffers in a stream. In this chapter we cover how buffers are used in Akka Streams.

当上游和下游速率不同时,特别是当吞吐量有峰值时,在流中引入缓冲区是有用的。在本章中,我们将介绍如何在Akka流中使用缓冲区。

Buffers for asynchronous operators

异步运算符的缓冲区

In this section we will discuss internal buffers that are introduced as an optimization when using asynchronous operators.

在本节中,我们将讨论使用异步运算符时作为优化引入的内部缓冲区。

To run an operator asynchronously it has to be marked explicitly as such using the `.async` method. Being run asynchronously means that an operator, after handing out an element to its downstream consumer is able to immediately process the next message. To demonstrate what we mean by this, let’s take a look at the following example:

要异步运行运算符,必须使用“.async”方法将其显式标记为。异步运行意味着操作符在将元素分发给其下游消费者之后能够立即处理下一条消息。为了说明我们的意思,让我们看看下面的例子:

Source(1 to 3)
  .map { i =>
    println(s"A: $i"); i
  }
  .async
  .map { i =>
    println(s"B: $i"); i
  }
  .async
  .map { i =>
    println(s"C: $i"); i
  }
  .async
  .runWith(Sink.ignore)

 

Running the above example, one of the possible outputs looks like this:

运行上面的示例,可能的输出之一如下所示:

A: 1
A: 2
B: 1
A: 3
B: 2
C: 1
B: 3
C: 2
C: 3

 

Note that the order is not A:1, B:1, C:1, A:2, B:2, C:2, which would correspond to the normal fused synchronous execution model of flows where an element completely passes through the processing pipeline before the next element enters the flow. The next element is processed by an asynchronous operator as soon as it has emitted the previous one.

注意,顺序不是A:1、B:1、C:1、A:2、B:2、C:2,这将对应于流的正常融合同步执行模型,其中一个元素在下一个元素进入流之前完全通过处理管道。下一个元素在发出前一个元素后立即由异步运算符处理。

While pipelining in general increases throughput, in practice there is a cost of passing an element through the asynchronous (and therefore thread crossing) boundary which is significant. To amortize this cost Akka Streams uses a windowedbatching backpressure strategy internally. It is windowed because as opposed to a Stop-And-Wait protocol multiple elements might be “in-flight” concurrently with requests for elements. It is also batching because a new element is not immediately requested once an element has been drained from the window-buffer but multiple elements are requested after multiple elements have been drained. This batching strategy reduces the communication cost of propagating the backpressure signal through the asynchronous boundary.

虽然流水线通常会提高吞吐量,但实际上,通过异步(因此是线程交叉)边界传递元素的代价是非常大的。为了分摊这一成本,Akka Streams在内部使用了一种窗口化、批量化的背压策略。它是窗口化的,因为与Stop-And-Wait协议相反,多个元素可能与元素请求同时“在运行”。它也是批处理的,因为当一个元素从窗口缓冲区中排出后,不会立即请求一个新元素,而是在排出多个元素之后请求多个元素。这种批处理策略降低了通过异步边界传播背压信号的通信开销。

While this internal protocol is mostly invisible to the user (apart from its throughput increasing effects) there are situations when these details get exposed. In all of our previous examples we always assumed that the rate of the processing chain is strictly coordinated through the backpressure signal causing all operators to process no faster than the throughput of the connected chain. There are tools in Akka Streams however that enable the rates of different segments of a processing chain to be “detached” or to define the maximum throughput of the stream through external timing sources. These situations are exactly those where the internal batching buffering strategy suddenly becomes non-transparent.

虽然流水线通常会提高吞吐量,但实际上,通过异步(因此是线程交叉)边界传递元素的代价是非常大的。为了分摊这一成本,Akka Streams在内部使用了一种窗口化、批量化的背压策略。它是窗口化的,因为与Stop-And-Wait协议相反,多个元素可能与元素请求同时“在运行”。它也是批处理的,因为当一个元素从窗口缓冲区中排出后,不会立即请求一个新元素,而是在排出多个元素之后请求多个元素。这种批处理策略降低了通过异步边界传播背压信号的通信开销。

Internal buffers and their effect

内部缓冲器及其作用

As we have explained, for performance reasons Akka Streams introduces a buffer for every asynchronous operator. The purpose of these buffers is solely optimization, in fact the size of 1 would be the most natural choice if there would be no need for throughput improvements. Therefore it is recommended to keep these buffer sizes small, and increase them only to a level suitable for the throughput requirements of the application. Default buffer sizes can be set through configuration:

如前所述,出于性能原因,Akka Streams为每个异步运算符引入了一个缓冲区。这些缓冲区的目的仅仅是优化,实际上,如果不需要提高吞吐量,那么1的大小将是最自然的选择。因此,建议将这些缓冲区的大小保持较小,并仅将其增加到适合应用程序吞吐量要求的水平。可以通过配置设置默认缓冲区大小:

akka.stream.materializer.max-input-buffer-size = 16

 

Alternatively they can be set per stream by adding an attribute to the complete RunnableGraph or on smaller segments of the stream it is possible by defining a separate `Flow` with these attributes:

或者,可以通过向完整的RunnableGraph添加属性来为每个流设置这些属性,或者在流的较小片段上使用以下属性定义单独的“Flow”:
val section = Flow[Int].map(_ * 2).async.addAttributes(Attributes.inputBuffer(initial = 1, max = 1)) // the buffer size of this map is 1
val flow = section.via(Flow[Int].map(_ / 2)).async // the buffer size of this map is the default
val runnableGraph =
  Source(1 to 10).via(flow).to(Sink.foreach(elem => println(elem)))

val withOverriddenDefaults = runnableGraph.withAttributes(Attributes.inputBuffer(initial = 64, max = 64))

 

Here is an example of a code that demonstrate some of the issues caused by internal buffers:

下面是一个代码示例,它演示了由内部缓冲区引起的一些问题:

import scala.concurrent.duration._
case class Tick()

RunnableGraph.fromGraph(GraphDSL.create() { implicit b =>
  import GraphDSL.Implicits._

  // this is the asynchronous stage in this graph
  val zipper = b.add(ZipWith[Tick, Int, Int]((tick, count) => count).async)

  Source.tick(initialDelay = 3.second, interval = 3.second, Tick()) ~> zipper.in0

  Source
    .tick(initialDelay = 1.second, interval = 1.second, "message!")
    .conflateWithSeed(seed = (_) => 1)((count, _) => count + 1) ~> zipper.in1

  zipper.out ~> Sink.foreach(println)
  ClosedShape
})

 

Running the above example one would expect the number 3 to be printed in every 3 seconds (the conflateWithSeed step here is configured so that it counts the number of elements received before the downstream `ZipWith` consumes them). What is being printed is different though, we will see the number 1. The reason for this is the internal buffer which is by default 16 elements large, and prefetches elements before the `ZipWith` starts consuming them. It is possible to fix this issue by changing the buffer size of `ZipWith` to 1. We will still see a leading 1 though which is caused by an initial prefetch of the `ZipWith` element.

运行上面的例子,我们希望数字3每3秒打印一次(这里的conflateWithSeed步骤被配置为在下游的“ZipWith”消耗它们之前计算接收到的元素的数量)。不过,我们会看到数字1。这样做的原因是内部缓冲区默认为16个大元素,并且在“ZipWith”开始使用它们之前预取元素。可以通过将“ZipWith”的缓冲区大小更改为1来解决此问题。我们仍然会看到一个前导的1,它是由“ZipWith”元素的初始预取引起的。

Note

In general, when time or rate driven operators exhibit strange behavior, one of the first solutions to try should be to decrease the input buffer of the affected elements to 1.

注意

一般来说,当时间或速率驱动的运算符表现出奇怪的行为时,首先要尝试的解决方案之一应该是将受影响元素的输入缓冲区减少到1。

Buffers in Akka Streams

akka streams 缓冲区

In this section we will discuss explicit user defined buffers that are part of the domain logic of the stream processing pipeline of an application.

在本节中,我们将讨论显式用户定义的缓冲区,这些缓冲区是应用程序流处理管道的域逻辑的一部分。

The example below will ensure that 1000 jobs (but not more) are dequeued from an external (imaginary) system and stored locally in memory - relieving the external system:

下面的示例将确保1000个作业(但不超过)从外部(假想)系统中出列并存储在本地内存中,从而减轻外部系统的负担:

// Getting a stream of jobs from an imaginary external system as a Source
val jobs: Source[Job, NotUsed] = inboundJobsConnector()
jobs.buffer(1000, OverflowStrategy.backpressure)

 

The next example will also queue up 1000 jobs locally, but if there are more jobs waiting in the imaginary external systems, it makes space for the new element by dropping one element from the tail of the buffer. Dropping from the tail is a very common strategy but it must be noted that this will drop the youngest waiting job. If some “fairness” is desired in the sense that we want to be nice to jobs that has been waiting for long, then this option can be useful.

下一个例子还将在本地排队1000个作业,但是如果有更多的作业在假想的外部系统中等待,它将通过从缓冲区尾部删除一个元素来为新元素腾出空间。甩尾是一种非常常见的策略,但必须注意的是,这将减少最年轻的等待工作。如果我们希望对等待已久的工作表现出某种“公平”的要求,那么这个选择是有用的。
jobs.buffer(1000, OverflowStrategy.dropTail)

 

Instead of dropping the youngest element from the tail of the buffer a new element can be dropped without enqueueing it to the buffer at all.

与从缓冲区尾部删除最年轻的元素不同,可以删除一个新元素,而无需将其排入缓冲区。
jobs.buffer(1000, OverflowStrategy.dropNew)

 

Here is another example with a queue of 1000 jobs, but it makes space for the new element by dropping one element from the head of the buffer. This is the oldest waiting job. This is the preferred strategy if jobs are expected to be resent if not processed in a certain period. The oldest element will be retransmitted soon, (in fact a retransmitted duplicate might be already in the queue!) so it makes sense to drop it first.

下面是另一个包含1000个作业的队列的示例,但是它通过从缓冲区头部删除一个元素来为新元素腾出空间。这是最古老的等待工作。如果作业在某个时间段内未被处理,则这是首选策略。最旧的元素将很快被重新传输(事实上,重新传输的副本可能已经在队列中!)所以先放下它是有意义的。
jobs.buffer(1000, OverflowStrategy.dropHead)
 

Compared to the dropping strategies above, dropBuffer drops all the 1000 jobs it has enqueued once the buffer gets full. This aggressive strategy is useful when dropping jobs is preferred to delaying jobs.

与上面的丢弃策略相比,dropBuffer会在缓冲区满后丢弃它已排队的所有1000个作业。这种激进的策略在裁员而不是推迟工作时很有用。

jobs.buffer(1000, OverflowStrategy.dropBuffer)

If our imaginary external job provider is a client using our API, we might want to enforce that the client cannot have more than 1000 queued jobs otherwise we consider it flooding and terminate the connection. This is achievable by the error strategy which fails the stream once the buffer gets full.

如果我们想象的外部作业提供程序是一个使用API的客户机,我们可能需要强制要求客户机的排队作业不能超过1000个,否则我们会认为它会溢出并终止连接。这可以通过错误策略来实现,一旦缓冲区满了,它就会使流失败。

jobs.buffer(1000, OverflowStrategy.fail)

 

Rate transformation

速率变换

Understanding conflate

理解冲突

 

When a fast producer can not be informed to slow down by backpressure or some other signal, conflate might be useful to combine elements from a producer until a demand signal comes from a consumer.

Below is an example snippet that summarizes fast stream of elements to a standard deviation, mean and count of elements that have arrived while the stats have been calculated.

当快速生产商无法通过背压或其他信号通知其减速时,conflate可能有助于组合生产商的要素,直到消费者发出需求信号。
val statsFlow = Flow[Double].conflateWithSeed(immutable.Seq(_))(_ :+ _).map { s =>
  val μ = s.sum / s.size
  val se = s.map(x => pow(x - μ, 2))
  val σ = sqrt(se.sum / se.size)
  (σ, μ, s.size)
}

 

This example demonstrates that such flow’s rate is decoupled. The element rate at the start of the flow can be much higher than the element rate at the end of the flow.

这个例子说明这样的流量是解耦的。流开始时的元素速率可能远高于流结束时的元素速率。

 

Another possible use of conflate is to not consider all elements for summary when the producer starts getting too fast. The example below demonstrates how conflate can be used to randomly drop elements when the consumer is not able to keep up with the producer.

 conflate的另一个可能的用法是,当生产者开始变得太快时,不考虑所有的元素进行总结。下面的示例演示了当消费者跟不上生产者时,如何使用conflate随机删除元素。

val p = 0.01
val sampleFlow = Flow[Double]
  .conflateWithSeed(immutable.Seq(_)) {
    case (acc, elem) if Random.nextDouble() < p => acc :+ elem
    case (acc, _)                               => acc
  }
  .mapConcat(identity)

 

See also conflate and conflateWithSeed` for more information and examples.

有关更多信息和示例,请参见conflate和conflateWithSeed`。

 

Understanding extrapolate and expand

理解外推和扩展

Now we will discuss two operators, extrapolate and expand, helping to deal with slow producers that are unable to keep up with the demand coming from consumers. They allow for additional values to be sent as elements to a consumer.

现在,我们将讨论两个运营商,外推和扩张,帮助应对那些无法跟上消费者需求的缓慢生产商。它们允许将附加值作为元素发送给使用者。

As a simple use case of extrapolate, here is a flow that repeats the last emitted element to a consumer, whenever the consumer signals demand and the producer cannot supply new elements yet.

作为一个简单的外推用例,这里有一个流,当消费者发出需求信号,而生产者还不能提供新元素时,向消费者重复最后发出的元素。

val lastFlow = Flow[Double].extrapolate(Iterator.continually(_))

 

For situations where there may be downstream demand before any element is emitted from upstream, you can use the initial parameter of extrapolate to “seed” the stream.

对于在任何元素从上游发出之前可能存在下游需求的情况,可以使用extrapolate的初始参数“种子”流。
val initial = 2.0
val seedFlow = Flow[Double].extrapolate(Iterator.continually(_), Some(initial))

 

extrapolate and expand also allow to produce meta-information based on demand signalled from the downstream. Leveraging this, here is a flow that tracks and reports a drift between a fast consumer and a slow producer.

推和扩展还允许根据下游发出的需求生成元信息。利用这一点,这里有一个流程来跟踪和报告快速消费者和缓慢生产商之间的漂移。
val driftFlow = Flow[Double].map(_ -> 0).extrapolate[(Double, Int)] { case (i, _) => Iterator.from(1).map(i -> _) }

 

And here’s a more concise representation with expand.

这里有一个更简洁的扩展表示。
val driftFlow = Flow[Double].expand(i => Iterator.from(0).map(i -> _))

 

The difference is due to the different handling of the Iterator-generating argument.

这种差异是由于迭代器生成参数的不同处理造成的。

While extrapolate uses an Iterator only when there is unmet downstream demand, expand always creates an Iterator and emits elements downstream from it.

extrapolate只在有未满足的下游需求时使用迭代器,expand总是创建迭代器并从它的下游发出元素。

This makes expand able to transform or even filter out (by providing an empty Iterator) the “original” elements.

这使得expand能够转换甚至过滤掉“原始”元素(通过提供一个空的迭代器)。

Regardless, since we provide a non-empty Iterator in both examples, this means that the output of this flow is going to report a drift of zero if the producer is fast enough - or a larger drift otherwise.

不管怎样,由于我们在两个例子中都提供了一个非空迭代器,这意味着如果生成器足够快,这个流的输出将报告零的漂移,否则将报告更大的漂移。

See also extrapolate and expand for more information and examples.

有关更多信息和示例,请参见外推和展开。

 

Context Propagation

上下文传播

It can be convenient to attach metadata to each element in the stream.

可以方便地将元数据附加到流中的每个元素。

For example, when reading from an external data source it can be useful to keep track of the read offset, so it can be marked as processed when the element reaches the Sink.

例如,当从外部数据源读取时,跟踪读取偏移量很有用,因此当元素到达接收器时,可以将其标记为已处理。

For this use case we provide the SourceWithContext and FlowWithContext variations on Source and Flow.

对于这个用例,我们提供了SourceWithContext和FlowWithContext在Source和Flow上的变体。

Essentially, a FlowWithContext is just a Flow that contains tuples of element and context, but the advantage is in the operators: most operators on FlowWithContext will work on the element rather than on the tuple, allowing you to focus on your application logic rather without worrying about the context.

从本质上讲,FlowWithContext只是一个包含元素元组和上下文元组的流,但优势在于运算符:FlowWithContext上的大多数运算符都将处理元素而不是元组,从而使您能够集中精力于应用程序逻辑,而不必担心上下文。

Restrictions

限制

Not all operations that are available on Flow are also available on FlowWithContext. This is intentional: in the use case of keeping track of a read offset, if the FlowWithContext was allowed to arbitrarily filter and reorder the stream, the Sink would have no way to determine whether an element was skipped or merely reordered and still in flight.

并非流上可用的所有操作在FlowWithContext上也可用。这是有意的:在跟踪读取偏移量的用例中,如果允许FlowWithContext任意地过滤和重新排序流,Sink将无法确定元素是被跳过还是只是重新排序而仍在运行中。

 

For this reason, FlowWithContext allows filtering operations (such as filterfilterNotcollect, etc) and grouping operations (such as groupedsliding, etc) but not reordering operations (such as mapAsyncUnordered and statefulMapConcat). Finally, also ‘one-to-n’ operations such as mapConcat are allowed.

因此,FlowWithContext允许过滤操作(如filter、filterNot、collect等)和分组操作(如grouped、sliding等),但不允许对操作进行重新排序(如mapsyncUnordered和statefulmapcontact)。最后,还允许“一对多”操作,如mapConcat。

Filtering operations will drop the context along with dropped elements, while grouping operations will keep all contexts from the elements in the group. Streaming one-to-many operations such as mapConcat associate the original context with each of the produced elements.

过滤操作将删除上下文和删除的元素,而分组操作将保留组中元素的所有上下文。流式一对多操作(如mapcontat)将原始上下文与生成的每个元素相关联。

As an escape hatch, there is a via operator that allows you to insert an arbitrary Flow that can process the tuples of elements and context in any way desired. When using this operator, it is the responsibility of the implementor to make sure this Flow does not perform any operations (such as reordering) that might break assumptions made by the Sink consuming the context elements.

作为一个转义填充,有一个via操作符允许您插入一个任意流,该流可以以任何方式处理元素和上下文的元组。使用此运算符时,实现者有责任确保此流不执行任何可能破坏接收器使用上下文元素所做的假设的操作(例如重新排序)。

Creation

创造

The simplest way to create a SourceWithContext is to first create a regular Source with elements from which the context can be extracted, and then use Source.asSourceWithContext.

创建SourceWithContext最简单的方法是首先创建一个常规源,其中包含可以从中提取上下文的元素,然后使用Source.asSourceWithContext.

Composition

组成

When you have a SourceWithContext source that produces elements of type Foo with a context of type Ctx, and a Flow flow from Foo to Bar, you cannot simply source.via(flow) to arrive at a SourceWithContext that produces elements of type Bar with contexts of type Ctx. The reason for this is that flow might reorder the elements flowing through it, making via challenging to implement.

如果您有一个SourceWithContext源,该源生成具有Ctx类型上下文的Foo类型的元素,并且有一个从Foo到Bar的流,那么您不能简单地来源.via(flow)到达一个SourceWithContext,该SourceWithContext生成具有Ctx类型上下文的Bar类型的元素。这是因为flow可能会重新排列流经它的元素,使via难以实现。

 

There is a Flow.asFlowWithContext which can be used when the types used in the inner Flow have room to hold the context. If this is not the case, a better solution is usually to build the flow from the ground up as a FlowWithContext, instead of first building a Flow and trying to convert it to FlowWithContext after-the-fact.

有一个Flow.asFlowWithContext当内部流中使用的类型有空间容纳上下文时,可以使用它。如果不是这样,一个更好的解决方案通常是将流从头构建为FlowWithContext,而不是先构建一个流,然后再尝试将其转换为FlowWithContext。

 

posted @ 2020-09-18 14:34  ~~。  阅读(476)  评论(0编辑  收藏  举报