最近听到个奇葩需求: Flink SQL 每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS

这个需求有点绕,举个栗子:
比如 11 点计算:
10:01:10 计算区间: 09:59:10 to 10:01:10 (不包含)
10:01:20 计算区间: 09:59:20 to 10:01:20 (不包含)
10:01:30 计算区间: 09:59:10 to 10:01:30 (不包含)
...
10:59:50 计算区间: 10:58:50 to 10:59:50 (不包含)
11:00:00 计算区间: 10:59:00 to 11:00:00 (不包含)

很明细,这么奇葩的需求,不可能用 flink sql 直接计算出来,稍作分解每 10 秒计算最近 1 分钟的 TPS 还差不多

每 10 秒计算最近 1 分钟的 TPS

上面的需求稍微分解一下,每 10 秒计算一次最近 1分钟的TPS,Flink sql 可以用 滑动窗口,很简单的写出来
sql 代码:

-- kafka source
drop table if exists user_log;
CREATE TABLE user_log
(
    `event_time`   TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL, -- from Debezium format
    `partition_id` BIGINT METADATA FROM 'partition' VIRTUAL,       -- from Kafka connector
    `offset`       BIGINT METADATA VIRTUAL,                        -- from Kafka connector
    user_id        VARCHAR,
    item_id        VARCHAR,
    category_id    VARCHAR,
    behavior       VARCHAR,
    ts             TIMESTAMP(3),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'earliest-offset'
      ,'format' = 'json'
      );


-- set table.sql-dialect=hive;
-- kafka sink
drop table if exists user_log_sink;
CREATE TABLE user_log_sink
(
    window_start STRING,
    window_end   STRING,
    tps          DECIMAL(26, 6)
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log_tps'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'latest-offset'
      ,'format' = 'json'
      );


-- 每10秒计算最近1分钟的 TPS
insert into user_log_sink
select date_format(window_start, 'yyyy-MM-dd HH:mm:ss'), date_format(window_end, 'yyyy-MM-dd HH:mm:ss'), count(user_id)/60.0 tps
FROM TABLE(
        HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' SECOND, INTERVAL '60' SECOND))
group by window_start, window_end
;

sql 结果:

{"window_start":"2023-02-10 13:29:50","window_end":"2023-02-10 13:30:50","tps":500.316667}
{"window_start":"2023-02-10 13:30:00","window_end":"2023-02-10 13:31:00","tps":503.333333}
{"window_start":"2023-02-10 13:30:10","window_end":"2023-02-10 13:31:10","tps":503.333333}
{"window_start":"2023-02-10 13:30:20","window_end":"2023-02-10 13:31:20","tps":503.333333}
{"window_start":"2023-02-10 13:30:30","window_end":"2023-02-10 13:31:30","tps":503.333333}
{"window_start":"2023-02-10 13:30:40","window_end":"2023-02-10 13:31:40","tps":503.333333}
{"window_start":"2023-02-10 13:30:50","window_end":"2023-02-10 13:31:50","tps":503.016667}

整点计算最近 1 分钟的 TPS

上面的需求稍微再分解一下,每小时计算一次最近 1分钟的TPS,Flink sql 可以用 滚动窗口,也很简单

  • 注:为了方便测试,这个每 10 分钟计算一次
drop table if exists user_log_sink_1;
CREATE TABLE user_log_sink_1
(
    window_start STRING,
    window_end   STRING,
    ts_min       STRING,
    ts_max       STRING,
    tps          DECIMAL(26, 6)
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log_tps'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'latest-offset'
      ,'format' = 'json'
      );

-- flink sql 竟然没有date_add 函数,官网的日期函数都是些什么啊,变通一下,用 TIMESTAMPDIFF函数,判断ts 时间和创建结束时间是否在同一分钟,即 最近 1 分钟的数据
-- 为了方便测试,这个每 10 分钟计算一次
insert into user_log_sink_1
select window_start, window_end, ts_min, ts_max, tps
from(
        select date_format(window_start, 'yyyy-MM-dd HH:mm:ss') window_start
             , date_format(window_end, 'yyyy-MM-dd HH:mm:ss') window_end
             , count(user_id)/60.0 tps
             , date_format(min(ts), 'yyyy-MM-dd HH:mm:ss') ts_min
             , date_format(max(ts), 'yyyy-MM-dd HH:mm:ss') ts_max
        FROM TABLE(
                TUMBLE(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE))
        where TIMESTAMPDIFF( MINUTE, ts , window_end) <= 0
        group by window_start, window_end
    )
;

sql 结果:

{"window_start":"2023-02-10 15:50:00","window_end":"2023-02-10 16:00:00","ts_min":"2023-02-10 15:59:00","ts_max":"2023-02-10 15:59:59","tps":1E+1}
{"window_start":"2023-02-10 16:00:00","window_end":"2023-02-10 16:10:00","ts_min":"2023-02-10 16:09:00","ts_max":"2023-02-10 16:09:59","tps":1E+1}
{"window_start":"2023-02-10 16:10:00","window_end":"2023-02-10 16:20:00","ts_min":"2023-02-10 16:19:00","ts_max":"2023-02-10 16:19:59","tps":10.816667}

[失败] 整点计算过去 1 小每 10 秒的最近 1 分钟 TPS

需求不做拆解,就很难了,大概用 Flink SQL 写了一下,实现不了,主要是Flink sql 不支持不等值链接,不然用自关联,SQL 还是可以写出来的

这样的窗口会被识别成不等值连接,报错(即使把 and TIMESTAMPDIFF( second, t1.window_time , t2.ts) <= 0 and TIMESTAMPDIFF( second, t1.window_time , t2.ts) >= -60 去掉,也报一样的错,不然用笛卡尔积应该可以做出来

insert into user_log_sink_2
    t1.window_start_str
    ,t1.window_end_str
, t1.window_time
, sum(t2.coun)/60.0 tps
from (
select date_format(window_start, 'yyyy-MM-dd HH:mm:ss')                                    window_start_str
     , date_format(window_end, 'yyyy-MM-dd HH:mm:ss')                                      window_end_str
    , window_end
     , TO_TIMESTAMP(concat(substring(date_format(ts, 'yyyy-MM-dd HH:mm:ss'), 0, 18), '0')) window_time
FROM TABLE(
    HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE, INTERVAL '10' MINUTE))
group by window_start, window_end, substring(date_format(ts, 'yyyy-MM-dd HH:mm:ss'), 0, 18)
    ) t1
    left join
    (select window_start, window_end, ts, count(1) coun
    TABLE(
        HOP(TABLE user_log, DESCRIPTOR(ts), INTERVAL '10' MINUTE, INTERVAL '10' MINUTE))
    group by window_start, window_end, ts
        )t2 on t1.window_end = t2.window_end
            and TIMESTAMPDIFF( second, t1.window_time , t2.ts) <= 0
        and TIMESTAMPDIFF( second, t1.window_time , t2.ts) >= -60
group by t1.window_start_str
        ,t1.window_end_str
        , t1.window_time
;

报错

org.apache.flink.table.api.SqlParserException: SQL parse failed. Non-query expression encountered in illegal context
 at org.apache.flink.table.planner.parse.CalciteParser.parseSqlList(CalciteParser.java:82)
at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:102)
at org.apache.flink.table.api.internal.StatementSetImpl.addInsertSql(StatementSetImpl.java:62)
at org.apache.flink.table.api.bridge.scala.internal.StreamStatementSetImpl.addInsertSql(StreamStatementSetImpl.scala:36)
at com.rookie.submit.main.SqlSubmit$.$anonfun$main$1(SqlSubmit.scala:96)
at com.rookie.submit.main.SqlSubmit$.$anonfun$main$1$adapted(SqlSubmit.scala:82)
at scala.collection.Iterator.foreach(Iterator.scala:937)
at scala.collection.Iterator.foreach$(Iterator.scala:937)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
at scala.collection.IterableLike.foreach(IterableLike.scala:70)
at scala.collection.IterableLike.foreach$(IterableLike.scala:69)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.rookie.submit.main.SqlSubmit$.main(SqlSubmit.scala:82)
at com.rookie.submit.main.SqlSubmit.main(SqlSubmit.scala)

下集预告

用代码实现需求:每小时计算最近 1 小时内每 10 秒的最近 1 分钟 TPS

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文
flink 菜鸟公众号

posted on 2023-02-13 16:26  Flink菜鸟  阅读(603)  评论(0编辑  收藏  举报