Flink(七) —— 窗口

窗口概念

Windows are at the heart of processing infinite streams(无界流). Windows split the stream into “buckets” of finite (有限的)size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.

窗口是处理无界流的核心。窗口将无限流切割为有限流,将流数据分发到有限大小的桶(bucket)中进行分析。

窗口生命周期

In a nutshell(简而言之), a window is created as soon as (立刻)the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees(保证、确保) removal only for time-based windows and not for other types, e.g. global windows (see Window Assigners). For example, with an event-time-based windowing strategy that creates non-overlapping (or tumbling) windows every 5 minutes and has an allowed lateness of 1 min, Flink will create a new window for the interval between 12:00 and 12:05 when the first element with a timestamp that falls into this interval arrives, and it will remove it when the watermark passes the 12:06 timestamp.

简而言之,在属于该窗口的第一个元素到达时,一个窗口就立刻被创建,在属于该窗口的拥有最后的时间戳(end event time + latesness time)的事件到达时,窗口就会被完全移除。Flink会确保移除时间窗口,而不是其他类型的窗口。

In addition(另外), each window will have a Trigger (see Triggers) and a function (ProcessWindowFunction, ReduceFunction, AggregateFunction or FoldFunction) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.

另外,每个窗口都有一个触发器和一个函数与之关联。这个函数包含窗口内容的计算。

Apart from the above, you can specify an Evictor (see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.

窗口API

The general structure of a windowed Flink program is presented below. The first snippet (一小段)refers to keyed streams, while the second to non-keyed ones. As one can see(正如你所见), the only difference is the keyBy(...) call for the keyed streams and the window(...) which becomes windowAll(...) for non-keyed streams. This is also going to serve as a roadmap for the rest of the page.

窗口Flink程序的一般结构如下所示。第一个片段指的是被Key分组的流,而第二个片段指的是非被Keys分组的流。正如人们所看到的,唯一的区别是window(...)针对keyBy之后的keyedStream,而windowAll(...)针对非被Key化的数据流。

window()必须在keyBy之后才能用。

窗口的类型

窗口的类型:

  • 时间窗口

    • 滚动时间窗口
    • 滑动时间窗口
    • 会话窗口
  • 计数窗口

    • 滚动计数窗口
    • 滑动计数窗口

滚动窗口(Tumbling Windows)

A tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.

需求例子:统计每天的访问量,统计每小时的访问量

滑动窗口(Sliding Windows)

The sliding windows assigner assigns elements to windows of fixed length. Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.

For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.

需求例子:统计最近24小时的访问量

会话窗口(Session Windows)

The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows. Instead a session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.

窗口函数

  • 增量聚合函数
  • 全窗口函数

时间语义

时间类型

  • 事件时间:事件实际发生的时间
  • 处理时间:事件被处理的时间
  • 进入时间:事件进入流处理框架的时间

To work with event time, streaming programs need to set the time characteristic (特征) accordingly(相应地).

流式编程使用事件时间,相应地需要设置时间特征

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

指定时间戳(Assigning Timestamps)

In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.
Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.
There are two ways to assign timestamps and generate watermarks:

  1. Directly in the data stream source
  2. Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted

使用事件时间,Flink需要知道事件的时间戳,也就是流中的每个元素必须有它指定的事件时间戳。通常是从这个元素中的某些字段中提取出来。

水位线(Watermark)

由于网络、分布式等原因,会导致乱序数据的产生。

乱序数据会让窗口计算不准确。

水位线的作用是考虑到事件的到达时间可能有延迟,为了解决这个问题,而引入水位线的概念。

例如我设置一个延迟最大时间为3s,事件A、B、C、D、E依次到达Flink。

其对应的时间为A(第11s),B(第13s),C(第12s),D(第14s),E(第15s)。

A进入Flink,Watermark = 11s - 3s = 8s,也就是event time 小于等于8s的事件可以被处理了。
B进入,Watermark = 13 - 3 = 10s
C进入,Watermark = 12 - 3 = 9s
D进入,Watermark = 14 -3 = 11s,A被处理了
E进入,Watermark = 15 -3 = 12s,C被处理了。

依次类推,最后的处理顺序是,A、C、B、D、E,是按照事件时间进行处理,解决了乱序的问题。

从事件中获取事件时间以及watermark。


class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {

    val maxOutOfOrderness = 3500L // 3.5 seconds

    var currentMaxTimestamp: Long = _

    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        val timestamp = element.getCreationTime()
        currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
        timestamp
    }

    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        new Watermark(currentMaxTimestamp - maxOutOfOrderness)
    }
}

从事件中获取处理时间以及watermark

class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {

    val maxTimeLag = 5000L // 5 seconds

    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        element.getCreationTime
    }

    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current time minus the maximum time lag
        new Watermark(System.currentTimeMillis() - maxTimeLag)
    }
}

窗口操作

  • Process Time
  • Event Time

参考文档

Flink官方文档 —— Event Time
Flink官方文档 —— Windows
Flink官方文档 —— Generating Timestamps / Watermarks

posted @ 2020-03-02 01:24  清泉白石  阅读(438)  评论(0)    收藏  举报