Flink 窗口 window
一、基本概念
1.窗口分类
TimeWindow:按照时间生成 Window。对于 TimeWindow,可以根据窗口实现原理的不同分成三类:滚动窗口(TumblingWindow)、滑动窗口(Sliding Window)和会话窗口(Session Window)。
2.时间分类
Event Time:是事件创建的时间。它通常由事件中的时间戳描述,例如采集的日志数据中,每一条日志都会记录自己的生成时间,Flink 通过时间戳分配器访问事件时间戳。
Ingestion Time:是数据进入 Flink 的时间。
二、案例演示
案例1:按Processing Time划分滚动时间窗口
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object WindowTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val socketStream = env.socketTextStream("hadoop102",7777)
val dataStream: DataStream[SensorReading] = socketStream.map(d => {
val arr = d.split(",")
SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble)
})
//统计10秒内的最小温度
val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(10)) //10秒滚动窗口,不指定时间特性,默认为ProcessingTime
.reduce((data1, data2)=>(data1._1,data1._2.min(data2._2)))
//打印原始的dataStream
dataStream.print("data stream")
//打印窗口数据流
minTemperatureStream.print("min temperature")
env.execute("window test")
}
}
测试:
连续输入两条数据
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718200, 30.8 sensor_1, 1547718201, 40.8
在一个10秒的滚动窗口内,窗口流minTemperatureStream 只输出了一条数据。此时触发TimeWindow去计算的时机就是第一条数据来的10秒过后。
data stream> SensorReading(sensor_1,1547718200,30.8) data stream> SensorReading(sensor_1,1547718201,40.8) min temperature> (sensor_1,30.8)
案例2:按EventTime划分带水位的滚动时间窗口
知识点:
①通过env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)设定窗口的时间特性为事件时间。
②在assignTimestampsAndWatermarks()方法中,传递一个BoundedOutOfOrdernessTimestampExtractor类实现对象,构造器参数就是容忍的延迟时间,实现方法,指明时间戳用哪个字段。
③env.getConfig.setAutoWatermarkInterval(300) //周期性的生成watermark:系统会周期性的将watermark插入到流中(水位线也是一种特殊的事件!)。默认周期是200毫秒。
④如果在keyBy之前设置水位线,则所有分区共用一个watermark。但是输出还是根据keyBy的结果分开输出。
object WindowTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val socketStream = env.socketTextStream("hadoop102",7777)
val dataStream: DataStream[SensorReading] = socketStream
.map(d => {
val arr = d.split(",")
SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble)
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(2)) {
override def extractTimestamp(t: SensorReading): Long = t.timestamp * 1000
})
//.assignAscendingTimestamps(_.timestamp) //升序数据添指定时间戳
//统计5秒内的最小温度
val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(5)) //5秒滚动窗口
.reduce((data1, data2)=>(data1._1,data1._2.min(data2._2)))
//打印原始的dataStream
dataStream.print("data stream")
//打印窗口数据流
minTemperatureStream.print("min temperature")
env.execute("window test")
}
}
测试:
sockt输入数据如下
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718199, 29 sensor_1, 1547718200, 30 sensor_1, 1547718201, 31 sensor_1, 1547718202, 32 sensor_1, 1547718203, 33
控制台打印如下:
data stream> SensorReading(sensor_1,1547718199,29.0) data stream> SensorReading(sensor_1,1547718200,30.0) data stream> SensorReading(sensor_1,1547718201,31.0) data stream> SensorReading(sensor_1,1547718202,32.0) min temperature> (sensor_1,29.0) data stream> SensorReading(sensor_1,1547718203,33.0)
滚动窗口的第一个窗口的起始时间如何确定?
//滚动窗口,添加初始窗口的源码
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
if (timestamp > -9223372036854775808L) {
long start = TimeWindow.getWindowStartWithOffset(timestamp, this.offset, this.size);
return Collections.singletonList(new TimeWindow(start, start + this.size));
} else {
throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
//计算窗口起始点,时区偏移offset默认为0。windowSize实际上是滑动长度。加一下再取模也没有意义。
public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) { return timestamp - (timestamp - offset + windowSize) % windowSize; }
对于我们的测试数据,窗口起始点start = 1547718199 - (1547718199 - 0 + 5) % 5 = 1547718195,所以第一个窗口是1547718195到1547718200,如果没有水位线设置,会在接收到数据时间戳为1547718200时关闭窗口,但是由于设定了延迟2秒,所以当接收到时间戳为1547718202的数据时才打印输出。
也可以自定义watermark从事件数据中抽取时间戳。方式一,周期式,继承AssignerWithPeriodicWatermarks,默认200毫秒抽取一次。方式二,间断式,继承AssignerWithPunctuatedWatermarks,根据需要自定义筛选条件。周期式举例:
class MyAssigner extends AssignerWithPeriodicWatermarks[SensorReading]{
// 1分钟延迟
val bound = 60 * 1000
// 记录数据的最大时间戳
var maxTs = Long.MinValue
// 水位线等于最大时间戳减延迟
override def getCurrentWatermark: Watermark = new Watermark(maxTs - bound)
override def extractTimestamp(t: SensorReading, l: Long): Long = {
//更新最大的时间戳
maxTs = maxTs.max(t.timestamp * 1000)
//时间戳单位毫秒
t.timestamp * 1000
}
}
案例3:滑动时间窗口
滑动窗口和滚动窗口特性类似,滚动窗口可以看作一种特殊的滑动窗口,其窗口长度与滑动长度一样。在.timeWindow(Time.seconds(10),Time.seconds(5)) 方法中,设定了窗口的长度为10,滑动长度为5。窗口长度决定了窗口计算的数据的范围有多大,而滑动长度决定了窗口计算并关闭的时机。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val socketStream = env.socketTextStream("hadoop102",7777)
val dataStream: DataStream[SensorReading] = socketStream
.map(d => {
val arr = d.split(",")
SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble)
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(0)) {//简单起见,去掉水位设置
override def extractTimestamp(t: SensorReading): Long = t.timestamp * 1000
})
//统计15秒内的最小温度,5秒输出一次
val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(15), Time.seconds(5)) //滑动窗口
.reduce((data1, data2)=>(data1._1,data1._2.min(data2._2)))
//打印原始的dataStream
dataStream.print("data stream")
//打印窗口数据流
minTemperatureStream.print("min temperature")
env.execute("window test")
}
socket数据
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718199, 29 sensor_1, 1547718200, 30 sensor_1, 1547718201, 31
控制台打印
data stream> SensorReading(sensor_1,1547718199,29.0) data stream> SensorReading(sensor_1,1547718200,30.0) min temperature> (sensor_1,29.0) data stream> SensorReading(sensor_1,1547718201,31.0)
滑动窗口的第一个窗口如何确定?与滚动窗口不同,由于滑动长度与窗口长度不一样,所以会设置多个初始窗口。
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) { if (timestamp <= -9223372036854775808L) { throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?"); } else { List<TimeWindow> windows = new ArrayList((int)(this.size / this.slide)); long lastStart = TimeWindow.getWindowStartWithOffset(timestamp, this.offset, this.slide); for(long start = lastStart; start > timestamp - this.size; start -= this.slide) { windows.add(new TimeWindow(start, start + this.size)); } return windows; } }
对于这个案例,lastStart = 1547718199 - (1547718199 - 0 + 5) % 5 = 1547718195,只要start > 1547718199 -15 = 1547718184就会一直循环添加窗口,在第一次for循环中,添加第一个窗口1547718195到1547718210,第二次添加1547718190到1547718205,第三次添加1547718185到1547718200。生成的窗口数为Math.ceil(15.0 / 5.0),所以在输入第二条数据,时间戳为1547718200,就触发了第三个窗口关闭。

浙公网安备 33010602011771号