Flink处理函数

1. 简介

1. 简介

​ 处理函数ProcessFunction主要是定义数据流的转换操作,也可以把它划分为转换算子。处理函数提供了一个"定时服务", 我们可以通过它访问流中的事件、事件戳、水位线、甚至可以注册定时事件。该类继承自AbstractRichFunction,拥有富函数类的所有特性,同样可以访问状态(state)和其他运行时信息。另外处理函数可以将数据直接输出到侧输出流。 处理函数可以实现各种自定义的业务逻辑。

​ 处理函数的使用与其他转换类似,基于DataStream的.process 方法,传入一个ProcessFunction 函数。

简单使用:

package cn.qz.process;


import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;

import java.time.Duration;

public class Process1 {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L),

                new MyEvent("tq", "/product?id=3", 1800L)
        );

        // 无序流
        dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
//                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .process(new ProcessFunction<MyEvent, String>() {

                    @Override
                    public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
                        if ("zs".equals(value.user)) {
                            out.collect(value.getUser());
                        }
                    }
                }).print();

        executionEnvironment.execute();
    }
}

结果:

5> zs
3> zs
4> zs
6> zs
7> zs

简单解释:

1. ProcessFunction 继承自AbstractRichFunction。 接收两个泛型:I-输入数据类型;0-输出的数据类型
2. 内部定义了两个方法:
    抽象方法processElement:对于流中的每次元素都会调用一次,没有输出、结果的输出以Collector 输出
    非抽象方法onTimer: 用于定时触发的操作。这个方法只有在注册好的定时器触发的时候才会调用,而定时器是通过"定时服务"TimerService 来注册的。并且只有按键分区流"KeyedStream" 才支持定时操作。    

2. 分类

大致分为8个不同的处理函数:

1.processFunction: 最基本的处理函数,基于DataStream直接调用.process()时作为参数传入
2.KeyedProcessFunction:基于按键分区流KeyedStream。可以使用定时器。
3.ProcessWindowFunction:开窗之后的处理函数,也是全窗口函数的代表。基于WindowedStream
4.ProcessAllWindowFunction:基于AllWindowedStream 调用 .process() 
5.CoProcessFunction:合并两条流之后的处理函数。基于ConnectedStream
6.ProcessJoinFunction:间隔连接两条流之后的处理函数,基于IntervalJoined 
7.BroadcastProcessFunction:广播连接流处理函数,基于BroadcastConnectedStream
8.KeyedBroadcastProcessFunction:按键分区的广播连接流处理函数,同样基于BroadcastConnectedStream。与上面不同的是,这时的广播流是一个KeyedStream与广播流BroadcastStream做连接之后的产物。

​ 下面主要对KeyedProceddFunction 和 ProcessWindowFunction 做研究。

2. 按键分区处理函数KeyedProcessFunction

​ 按键分区是为了实现聚合统计或者开窗计算,分区后数据会分到不同的组,然后分配到不同的并行子任务中。另外只有KeyedStream 才支持TimerService 设置定时器的操作。所以一般会先用keyBy 进行分区后计算。

1. 定时器(Timer)和定时服务TimerService

TImerService 是Flink 关于时间和定时器的基础服务接口,包含一下六个方法:

/// 获取当前的处理时间
long currentProcessingTime();
/// 获取当前的水位线(事件时间)
long currentWatermark();
/// 注册处理时间定时器,当处理时间超过 time 时触发
void registerProcessingTimeTimer(long time);
/// 注册事件时间定时器,当水位线超过 time 时触发
void registerEventTimeTimer(long time);
/// 删除触发时间为 time 的处理时间定时器
void deleteProcessingTimeTimer(long time);
/// 删除触 发时间为 time 的处理时间定时器
void deleteEventTimeTimer(long time);

​ 可以分类两大类三种操作:基于处理时间和基于事件时间,获取、注册、删除定时器操作。

​ 对于每个key和时间戳,最多只有一个定时器。如果注册了多次,onTimer方法只被调用一次。另外onTimer方法和processElement 方法是同步调用的,不会出现状态的并发修改。

1. 基于处理时间的定时器

package cn.qz.process;

import cn.qz.time.MyEvent;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.util.Calendar;
import java.util.Random;

public class ProcessingTimeTimer {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);
        // 处理时间语义,不需要分配时间戳和watermark
        SingleOutputStreamOperator<MyEvent> dataStreamSource = executionEnvironment.addSource(new ClickSource());

        // 要用定时器,必须基于KeyedStream
        dataStreamSource.keyBy(data -> true)
                .process(new KeyedProcessFunction<Boolean, MyEvent, String>() {
                    @Override
                    public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
                        Long currTs = ctx.timerService().currentProcessingTime();
                        out.collect("数据到达,到达时间:" + new Timestamp(currTs));
                        // 注册一个10秒后的定时器
                        ctx.timerService().registerProcessingTimeTimer(currTs + 10 * 1000L);
                    }

                    @Override
                    public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
                        out.collect("定时器触发S,触发时间:" + new Timestamp(timestamp));
                        Thread.sleep(1 * 1000);
                        out.collect("定时器触发S,触发时间:" + new Timestamp(timestamp));
                    }
                })
                .print();

        executionEnvironment.execute();
    }
}

class ClickSource implements SourceFunction<MyEvent> {
    // 声明一个布尔变量,作为控制数据生成的标识位
    private Boolean running = true;
    @Override
    public void run(SourceContext<MyEvent> ctx) throws Exception {
        Random random = new Random();    // 在指定的数据集中随机选取数据
        String[] users = {"Mary", "Alice", "Bob", "Cary"};
        String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};

        while (running) {
            ctx.collect(new MyEvent(
                    users[random.nextInt(users.length)],
                    urls[random.nextInt(urls.length)],
                    Calendar.getInstance().getTimeInMillis()
            ));
            // 隔1秒生成一个点击事件,方便观测
            Thread.sleep(5000);
        }
    }
    @Override
    public void cancel() {
        running = false;
    }

}

结果:

数据到达,到达时间:2022-08-30 14:13:09.948
数据到达,到达时间:2022-08-30 14:13:14.97
定时器触发S,触发时间:2022-08-30 14:13:19.948
定时器触发S,触发时间:2022-08-30 14:13:19.948
数据到达,到达时间:2022-08-30 14:13:20.951
定时器触发S,触发时间:2022-08-30 14:13:24.97
定时器触发S,触发时间:2022-08-30 14:13:24.97
数据到达,到达时间:2022-08-30 14:13:25.973

2. 基于事件时间的定时器

package cn.qz.process;
import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;

public class EventTimeTimer {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<MyEvent> stream = env.addSource(new CustomSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                            @Override
                            public long extractTimestamp(MyEvent element, long recordTimestamp) {
                                return element.timestamp;
                            }
                        }));

        // 基于KeyedStream定义事件时间定时器
        stream.keyBy(data -> true)
                .process(new KeyedProcessFunction<Boolean, MyEvent, String>() {
                    @Override
                    public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
                        out.collect("数据到达,时间戳为:" + ctx.timestamp());
                        out.collect("数据到达,水位线为:" + ctx.timerService().currentWatermark() + "\n -------分割线-------");
                        // 注册一个10秒后的定时器
                        ctx.timerService().registerEventTimeTimer(ctx.timestamp() + 10 * 1000L);
                    }

                    @Override
                    public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
                        out.collect("定时器触发,触发时间:" + timestamp);
                    }
                })
                .print();

        env.execute();
    }

    // 自定义测试数据源
    public static class CustomSource implements SourceFunction<MyEvent> {
        @Override
        public void run(SourceContext<MyEvent> ctx) throws Exception {
            // 直接发出测试数据
            ctx.collect(new MyEvent("Mary", "./home", 1000L));
            // 为了更加明显,中间停顿5秒钟
            Thread.sleep(5000L);

            // 发出10秒后的数据
            ctx.collect(new MyEvent("Mary", "./home", 11000L));
            Thread.sleep(5000L);

            // 发出10秒+1ms后的数据
            ctx.collect(new MyEvent("Alice", "./cart", 11001L));
            Thread.sleep(5000L);
        }

        @Override
        public void cancel() { }
    }
}

结果:

数据到达,时间戳为:1000
数据到达,水位线为:-9223372036854775808
 -------分割线-------
数据到达,时间戳为:11000
数据到达,水位线为:999
 -------分割线-------
数据到达,时间戳为:11001
数据到达,水位线为:10999
 -------分割线-------
定时器触发,触发时间:11000
定时器触发,触发时间:21000
定时器触发,触发时间:21001

​ 可以看出事件语义下,定时器触发的条件就是水位线推进到设定的时间。

3. 窗口处理函数

​ 比较常用的还有基于窗口的ProcessWindowFunction和ProcessAllWindowFunction。

​ 进行窗口计算,我们可以直接用现成的聚合方法(sum/max/min), 也可以通过调用reduce或者aggregate 来自定义一般的增量聚合函数(ReduceFunction/AggregateFunction)。对于其他更复杂、需要窗口信息和额外状态的一些场景可以直接使用全窗口函数,把数据全部保存在窗口内,等到触发窗口计算时再统一处理。

1. ProcessWindowFunction

ProcessWindowFunction 既是处理函数又是全窗口函数。

public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window>
        extends AbstractRichFunction {
    public abstract void process(
            KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;
    public void clear(Context context) throws Exception {}
        /** The context holding window metadata. */
    public abstract class Context implements java.io.Serializable {
        /** Returns the window that is being evaluated. */
        public abstract W window();

        /** Returns the current processing time. */
        public abstract long currentProcessingTime();

        /** Returns the current event-time watermark. */
        public abstract long currentWatermark();

        /**
         * State accessor for per-key and per-window state.
         *
         * <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up by
         * implementing {@link ProcessWindowFunction#clear(Context)}.
         */
        public abstract KeyedStateStore windowState();

        /** State accessor for per-key global state. */
        public abstract KeyedStateStore globalState();

        /**
         * Emits a record to the side output identified by the {@link OutputTag}.
         *
         * @param outputTag the {@code OutputTag} that identifies the side output to emit to.
         * @param value The record to emit.
         */
        public abstract <X> void output(OutputTag<X> outputTag, X value);
    }
}    

​ process方法和之前的processElement 方法不同,不是一条一条处理,而是一批进行处理;多了一个clear 方法。clear 方法主要用于窗口的清理工作。

​ 没有TimerService 对象,只能通过context 的current... 获取相关的时间,与此同时,context 还有获取窗口状态和全局状态的方法。

窗口本身就包含了一个触发计算的时间点,如果需要再引入其他定时,可以借助于触发器(Trigger),trigger 中的TriggerContext 可以起到类似于TimerService 的作用:获取时间、注册和删除定时器。

2. ProcessAllWindowFunction

它和上面的ProcessWindowFunction 作用非常类似,只不过它是基于AllWindowedStream,相当于没用keyBy 的数据流直接开窗并调用.process() 方法。其API如下:

public abstract class ProcessAllWindowFunction<IN, OUT, W extends Window>
        extends AbstractRichFunction {
    
        public abstract void process(Context context, Iterable<IN> elements, Collector<OUT> out)
            throws Exception;
    
        public void clear(Context context) throws Exception {}
    
    ...
}        

3. 应用案例 - Top N

​ 比如我们实时统计一段时间内的热门url。统计最近十秒钟内最热门的两个url,每五秒钟更新一次。

​ 分析: 可以看出是一个滑动窗口来实现。

0. Source 代码

package cn.qz.process;

import cn.qz.time.MyEvent;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Calendar;
import java.util.Random;

class ClickSource implements SourceFunction<MyEvent> {

    // 声明一个布尔变量,作为控制数据生成的标识位
    private Boolean running = true;

    @Override
    public void run(SourceContext<MyEvent> ctx) throws Exception {
        Random random = new Random();    // 在指定的数据集中随机选取数据
        String[] users = {"Mary", "Alice", "Bob", "Cary"};
        String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};

        while (running) {
            ctx.collect(new MyEvent(
                    users[random.nextInt(users.length)],
                    urls[random.nextInt(urls.length)],
                    Calendar.getInstance().getTimeInMillis()
            ));
            // 隔1秒生成一个点击事件,方便观测
            Thread.sleep(1000);
        }
    }
    @Override
    public void cancel() {
        running = false;
    }

}

1. 基于ProcessAllWindowFunction 实现

​ 这个思路比较简单。 一个窗口内的所有数据都在一个子任务中计算。先用hashMap 维护url以及访问量; 然后转为Tuple2 存入ArrayList 进行排序后输出。

代码:

package cn.qz.process;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessAllWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;

public class ProcessAllWindowTopN {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);
        DataStreamSource<MyEvent> myEventDataStreamSource = executionEnvironment.addSource(new ClickSource());
        // 水位线
        SingleOutputStreamOperator<MyEvent> eventStream = myEventDataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps().withTimestampAssigner(
                new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }
        ));
        eventStream.map(new MapFunction<MyEvent, String>() {
            @Override
            public String map(MyEvent value) throws Exception {
                return value.getUrl();
            }
        })
        // 开滑动窗口
        .windowAll(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
        .process(new ProcessAllWindowFunction<String, String, TimeWindow>() {
            @Override
            public void process(Context context, Iterable<String> elements, Collector<String> out) throws Exception {
                // 用HashMap 维护记录。 <url, 次数>
                HashMap<String, Long> urlCountMap = new HashMap<>();
                for (String url : elements) {
                    if (urlCountMap.containsKey(url)) {
                        urlCountMap.put(url, urlCountMap.get(url) + 1);
                    } else {
                        urlCountMap.put(url, 1L);
                    }
                }
                // 采用一定的算法对hashMap 中的key 进行排序,排序完成之后输出. 可以转为Tuple2 然后排序后输出
                List<Tuple2<String, Long>> tuple2s = new ArrayList<>();
                urlCountMap.forEach((k, v) -> {
                    tuple2s.add(Tuple2.of(k, v));
                });
                tuple2s.sort(new Comparator<Tuple2<String, Long>>() {
                    @Override
                    public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
                        return o2.f1.intValue() - o1.f1.intValue();
                    }
                });
                // 取排序后的前两名,构建输出结果
                StringBuilder result = new StringBuilder();
                result.append("========================================\n");
                for (int i = 0; i < 2; i++) {
                    if (tuple2s.size() >= (i + 1)) {
                        Tuple2<String, Long> temp = tuple2s.get(i);
                        String info = "浏览量No." + (i + 1) +
                                " url:" + temp.f0 +
                                " 浏览量:" + temp.f1 +
                                " 窗口结束时间:" + new Timestamp(context.window().getEnd()) + "\n";

                        result.append(info);
                    }
                }
                result.append("========================================\n");
                out.collect(result.toString());
            }
        })
        // 打印
        .print();

        executionEnvironment.execute();
    }
}

结果:

========================================
浏览量No.1 url:./cart 浏览量:1 窗口结束时间:2022-08-30 15:31:50.0
浏览量No.2 url:./fav 浏览量:1 窗口结束时间:2022-08-30 15:31:50.0
========================================

========================================
浏览量No.1 url:./prod?id=2 浏览量:2 窗口结束时间:2022-08-30 15:31:55.0
浏览量No.2 url:./prod?id=1 浏览量:2 窗口结束时间:2022-08-30 15:31:55.0
========================================

========================================
浏览量No.1 url:./prod?id=2 浏览量:3 窗口结束时间:2022-08-30 15:32:00.0
浏览量No.2 url:./prod?id=1 浏览量:3 窗口结束时间:2022-08-30 15:32:00.0
========================================

========================================
浏览量No.1 url:./fav 浏览量:3 窗口结束时间:2022-08-30 15:32:05.0
    ...

2. 基于ProcessWindowFunction + KeyedProcessFunction

这个比较复杂:因为按url分组后实际是同一窗口的数据分到不同的子任务,分组计算后简单的拿到一个窗口内的统计结果(包括url、访问次数、窗口开始时间、窗口结束时间),然后对统计结果在keyBy 按窗口结束时间分组再次调用KeyedProcessFunction 汇总结果。

1》先按url进行分组,汇总得到UrlViewCount 对象(包括url、访问次数、窗口开始时间、窗口结束时间)。

2》再按UrlViewCount.窗口结束时间进行分组。分组后用KeyedProcessFunction 进行统计计算TopN

代码: 注意TopN 代码内部缓存元素的操作API以及定时器计算的API。

package cn.qz.process;

import cn.qz.time.MyEvent;
import cn.qz.window.UrlViewCount;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;

public class ProcessWindowFunctionTopN {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);

        SingleOutputStreamOperator<MyEvent> eventSingleOutputStreamOperator = executionEnvironment.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                            @Override
                            public long extractTimestamp(MyEvent element, long recordTimestamp) {
                                return element.getTimestamp();
                            }
                        })
                );

        // 按照url分组,计算每个url 的访问量
        SingleOutputStreamOperator<UrlViewCount> aggregate = eventSingleOutputStreamOperator.keyBy(data -> data.url)
                .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
                .aggregate(new UrlViewCountAgg(), new UrlViewCountResult());

        // 对结果中同一个窗口的统计数据,进行排序处理
        SingleOutputStreamOperator<String> result = aggregate.keyBy(data -> data.windowEnd)
                .process(new TopN(2));

        result.print("result");

        executionEnvironment.execute();
    }

    // 自定义增量聚合
    public static class UrlViewCountAgg implements AggregateFunction<MyEvent, Long, Long> {
        @Override
        public Long createAccumulator() {
            return 0L;
        }

        @Override
        public Long add(MyEvent value, Long accumulator) {
            return accumulator + 1;
        }

        @Override
        public Long getResult(Long accumulator) {
            return accumulator;
        }

        @Override
        public Long merge(Long a, Long b) {
            return null;
        }
    }

    // 自定义全窗口函数,只需要包装窗口信息
    public static class UrlViewCountResult extends ProcessWindowFunction<Long, UrlViewCount, String, TimeWindow> {

        @Override
        public void process(String url, Context context, Iterable<Long> elements, Collector<UrlViewCount> out) throws Exception {
            // 结合窗口信息,包装输出内容
            Long start = context.window().getStart();
            Long end = context.window().getEnd();
            out.collect(new UrlViewCount(url, elements.iterator().next(), start, end));
        }
    }

    // 自定义处理函数,排序取top n
    public static class TopN extends KeyedProcessFunction<Long, UrlViewCount, String> {
        // 将n作为属性
        private Integer n;
        // 定义一个列表状态
        private ListState<UrlViewCount> urlViewCountListState;

        public TopN(Integer n) {
            this.n = n;
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            // 从环境中获取列表状态句柄
            urlViewCountListState = getRuntimeContext().getListState(
                    new ListStateDescriptor<UrlViewCount>("url-view-count-list",
                            Types.POJO(UrlViewCount.class)));
        }

        @Override
        public void processElement(UrlViewCount value, Context ctx, Collector<String> out) throws Exception {
            // 将count数据添加到列表状态中,保存起来
            urlViewCountListState.add(value);
            // 注册 window end + 1ms后的定时器,等待所有数据到齐开始排序
            ctx.timerService().registerEventTimeTimer(ctx.getCurrentKey() + 1);
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
            // 将数据从列表状态变量中取出,放入ArrayList,方便排序
            ArrayList<UrlViewCount> urlViewCountArrayList = new ArrayList<>();
            for (UrlViewCount urlViewCount : urlViewCountListState.get()) {
                urlViewCountArrayList.add(urlViewCount);
            }
            // 清空状态,释放资源
            urlViewCountListState.clear();

            // 排序
            urlViewCountArrayList.sort(new Comparator<UrlViewCount>() {
                @Override
                public int compare(UrlViewCount o1, UrlViewCount o2) {
                    return o2.count.intValue() - o1.count.intValue();
                }
            });

            // 取前两名,构建输出结果
            StringBuilder result = new StringBuilder();
            result.append("========================================\n");
            result.append("窗口结束时间:" + new Timestamp(timestamp - 1) + "\n");
            for (int i = 0; i < this.n; i++) {
                UrlViewCount UrlViewCount = urlViewCountArrayList.get(i);
                String info = "No." + (i + 1) + " "
                        + "url:" + UrlViewCount.url + " "
                        + "浏览量:" + UrlViewCount.count + "\n";
                result.append(info);
            }
            result.append("========================================\n");
            out.collect(result.toString());
        }
    }
}

UrlViewCount:

package cn.qz.window;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@AllArgsConstructor
@NoArgsConstructor
public class UrlViewCount {

    public String url;

    public Long count;

    public Long windowStart;

    public Long windowEnd;

}

结果:

result> ========================================
窗口结束时间:2022-08-30 16:02:50.0
No.1 url:./fav 浏览量:3
No.2 url:./prod?id=1 浏览量:1
========================================

result> ========================================
窗口结束时间:2022-08-30 16:02:55.0
No.1 url:./fav 浏览量:5
No.2 url:./prod?id=1 浏览量:2
========================================

result> ========================================
窗口结束时间:2022-08-30 16:03:00.0
No.1 url:./fav 浏览量:3
No.2 url:./prod?id=2 浏览量:3
========================================
    
...

3. 侧输出流

​ 处理函数还有一个特有功能,就是将自定义的数据放入"侧输出流"输出。可以认为侧输出流实际是主流上分叉出的支流。

package cn.qz.process;

import cn.qz.time.MyEvent;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.time.Duration;

public class Process1 {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 构造数据
        DataStreamSource<MyEvent> dataStreamSource = executionEnvironment.fromElements(
                new MyEvent("zs", "/user", 1000L),
                new MyEvent("zs", "/order", 1500L),
                new MyEvent("zs", "/product?id=1", 2000L),
                new MyEvent("zs", "/product?id=2", 2300L),
                new MyEvent("zs", "/product?id=3", 1800L),

                new MyEvent("ls", "/user", 1000L),
                new MyEvent("ls", "/order", 1500L),
                new MyEvent("ls", "/product?id=1", 2000L),
                new MyEvent("ls", "/product?id=2", 2300L),
                new MyEvent("ls", "/product?id=3", 1800L),

                new MyEvent("tq", "/product?id=3", 1800L)
        );

        // 定义侧输出流标签
        OutputTag<String> outputTag = new OutputTag<String>("late") {
        };

        // 无序流
        SingleOutputStreamOperator<String> process = dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<MyEvent>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<MyEvent>() {
                    @Override
                    public long extractTimestamp(MyEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }))
                // 用户分到相同的组进行计算
//                .keyBy(data -> true)
                // 滚动事件时间窗口。 相当于每2s 是一个窗口
                .process(new ProcessFunction<MyEvent, String>() {

                    @Override
                    public void processElement(MyEvent value, Context ctx, Collector<String> out) throws Exception {
                        if ("zs".equals(value.user)) {
                            out.collect(value.getUser());
                        } else {
                            ctx.output(outputTag, value.getUser());
                        }
                    }
                });

        process.print();

        process.getSideOutput(outputTag).print("late");

        executionEnvironment.execute();
    }
}

结果:

late:7> ls
2> zs
late:6> ls
1> zs
5> zs
3> zs
4> zs
late:8> ls
late:3> tq
late:1> ls
late:2> ls
posted @ 2022-08-30 22:48  QiaoZhi  阅读(247)  评论(0编辑  收藏  举报