Flink 模式检测(CEP)及案例

CEP 可以理解为类似正则表达式的东西

Flink 提供复杂事件处理(CEP)库,该库允许在事件流中进行模式检测。此外,Flink 的 SQL API 提供了一种关系式的查询表达方式,其中包含大量内置函数和基于规则的优化,可以开箱即用。

IDEA 中使用 CEP

导入依赖

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-cep_2.11</artifactId>
  <version>1.11.2</version>
</dependency>

SQL 语义

每个 MATCH_RECOGNIZE 查询都包含以下子句:

  • PARTITION BY - 定义表的逻辑分区;类似于 GROUP BY 操作。
  • ORDER BY - 指定传入行的排序方式;这是必须的,因为模式依赖于顺序。
  • MEASURES - 定义子句的输出;类似于 SELECT 子句。
  • ONE ROW PER MATCH - 输出方式,定义每个匹配项应产生多少行。
  • AFTER MATCH SKIP - 指定下一个匹配的开始位置;这也是控制单个事件可以属于多少个不同匹配项的方法。
  • PATTERN - 允许使用类似于 正则表达式 的语法构造搜索的模式。
  • DEFINE - 本部分定义了模式变量必须满足的条件。

注意 目前,MATCH_RECOGNIZE 子句只能应用于追加表。此外,它也总是生成一个追加表。

贪婪量词和勉强量词

每一个量词可以是 贪婪(默认行为)的或者 勉强 的。贪婪的量词尝试匹配尽可能多的行,而勉强的量词则尝试匹配尽可能少的行。

为了说明区别,可以通过查询查看以下示例,其中贪婪量词应用于 B 变量:

SELECT *
FROM Ticker
    MATCH_RECOGNIZE(
        PARTITION BY symbol
        ORDER BY rowtime
        MEASURES
            C.price AS lastPrice
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (A B* C)
        DEFINE
            A AS A.price > 10,
            B AS B.price < 15,
            C AS C.price > 12
    )

假设我们有以下输入:

 symbol  tax   price          rowtime
======= ===== ======== =====================
 XYZ     1     10       2018-09-17 10:00:02
 XYZ     2     11       2018-09-17 10:00:03
 XYZ     1     12       2018-09-17 10:00:04
 XYZ     2     13       2018-09-17 10:00:05
 XYZ     1     14       2018-09-17 10:00:06
 XYZ     2     16       2018-09-17 10:00:07

上面的模式将产生以下输出:

 symbol   lastPrice
======== ===========
 XYZ      16

B* 修改为 B*? 的同一查询,这意味着 B* 应该是勉强的,将产生:

 symbol   lastPrice
======== ===========
 XYZ      13
 XYZ      16

模式变量 B 只匹配价格为 12 的行,而不是包含价格为 121314 的行。

注意 模式的最后一个变量不能使用贪婪量词。因此,不允许使用类似 (A B*) 的模式。通过引入条件为 B 的人工状态(例如 C),可以轻松解决此问题。因此,你可以使用类似以下的查询:

PATTERN (A B* C)
DEFINE
    A AS condA(),
    B AS condB(),
    C AS NOT condB()

注意 目前不支持可选的勉强量词(A?? 或者 A{0,1}?)。

SQL Client 中使用 MATCH_RECOGNIZE 子句,你无需执行任何操作,因为默认情况下包含所有依赖项。

数据
1001,123
1001,23
1001,1
1001,1000
1001,300
1002,2
1002,0.5
1002,200
1002,0.5
1002,6000


CREATE TABLE `event` (
 `user` String,
 `money` Double,
 proc_time AS PROCTIME()
) WITH (
 'connector' = 'kafka',
 'topic' = 'event',
 'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
 'properties.group.id' = 'asdasdasd',
 'format' = 'csv',
 'scan.startup.mode' = 'latest-offset'
);


对于一个账户,如果出现小于 $1 美元的交易后紧跟着一个大于 $500 的交易,就输出一个报警信息。

SELECT *
FROM `event`
    MATCH_RECOGNIZE ( -- 模式检测,在数据流中根据一个规则匹配数据
      PARTITION BY `user` -- 分组字段
      ORDER BY proc_time -- 数据的顺序,只能是事件时间或者处理时间
      MEASURES -- 匹配成功之后返回哪些数据
        A.money AS amoney,
        B.money AS bmoney
      PATTERN (A B) WITHIN INTERVAL '5' SECOND -- 匹配模式,以及时间限制(官网中找的参数,5秒之内完成)
      DEFINE -- 具体的匹配规则
        A AS `money` < 1,
        B AS `money` > 500
    ) AS T


对于一个账户,如果连续出现小于 $1 美元的交易后紧跟着一个大于 $500 的交易,就输出一个报警信息。


SELECT *
FROM `event`
    MATCH_RECOGNIZE ( 
      PARTITION BY `user` 
      ORDER BY proc_time 
      MEASURES 
        FIRST(A.money) as fa, -- FIRST():取最前的数据
        LAST(A.money) AS la, -- LAST():取最新的数据,默认的
        AVG(A.money)  as avgMoney,
        B.money AS bmoney
      AFTER MATCH SKIP PAST LAST ROW -- 在当前匹配的最后一行之后的下一行继续模式匹配。不指定的话这边会出现三个结果
      PATTERN (A+ B) WITHIN INTERVAL '5' SECOND -- A+ : A 出现1次或者多次,类似正则表达式
      DEFINE 
        A AS `money` < 1,
        B AS `money` > 500
    ) AS T;


1002,0.5
1002,0.6
1002,0.7
1002,6000


对于一个账户,如果连续出现多次交易的平均值小于1 后紧跟着一个大于 $500 的交易,就输出一个报警信息。

SELECT *
FROM `event`
    MATCH_RECOGNIZE ( 
      PARTITION BY `user` 
      ORDER BY proc_time 
      MEASURES 
        FIRST(A.money) as fa,
        LAST(A.money) AS la,
        AVG(A.money)  as avgMoney,
        B.money AS bmoney
      AFTER MATCH SKIP PAST LAST ROW
      PATTERN (A+ B) WITHIN INTERVAL '5' SECOND 
      DEFINE 
        A AS AVG(A.money) < 1,
        B AS B.money > 500
    ) AS T;

1002,0.5
1002,0.6
1002,1.2
1002,6000

每个 MATCH_RECOGNIZE 查询都包含以下子句:

PARTITION BY - 定义表的逻辑分区;类似于 GROUP BY 操作。
ORDER BY - 指定传入行的排序方式;这是必须的,因为模式依赖于顺序。
MEASURES - 定义子句的输出;类似于 SELECT 子句。
ONE ROW PER MATCH - 输出方式,定义每个匹配项应产生多少行。
AFTER MATCH SKIP - 指定下一个匹配的开始位置;这也是控制单个事件可以属于多少个不同匹配项的方法。
PATTERN - 允许使用类似于 正则表达式 的语法构造搜索的模式。
DEFINE - 本部分定义了模式变量必须满足的条件。

官网中给了很多例子,到时候用到什么可以去官网找,不用记代码

PATTERN -- 注意 不支持可能产生空匹配的模式。此类模式的示例如 PATTERN (A*)PATTERN (A? B*)PATTERN (A{0,} B{0,} C*) 等。

模式匹配 案例

LAST(B.price,1) -- 取上一条数据,如果没有则返回null

ACME,2022-03-26 10:00:00,12,1
ACME,2022-03-26 10:00:01,17,2
ACME,2022-03-26 10:00:02,19,1
ACME,2022-03-26 10:00:03,21,3
ACME,2022-03-26 10:00:04,25,2
ACME,2022-03-26 10:00:05,18,1
ACME,2022-03-26 10:00:06,15,1
ACME,2022-03-26 10:00:07,14,2
ACME,2022-03-26 10:00:08,24,2
ACME,2022-03-26 10:00:09,25,2
ACME,2022-03-26 10:00:10,19,1

symbol: String                           # 股票的代号
rowtime:  TIMESTAMP(3)                   # 更改这些值的时间点
price: BIGINT                            # 股票的价格
tax: BIGINT                              # 股票应纳税额


监控股票的价格,从kafka中读取数据,实时监控连续下降的股票区间,将结果保存到mysql中

1、创建kafka source 表

CREATE TABLE t_symbol (
 symbol String,
 rowtime TIMESTAMP(3),
 price BIGINT,
 tax BIGINT,
 WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND
) WITH (
 'connector' = 'kafka',
 'topic' = 'symbol',
 'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
 'properties.group.id' = 'asdasdasd',
 'format' = 'csv',
 'scan.startup.mode' = 'latest-offset'
);

创建mysql sink表 -- 需要在MySQL中创建对应的表
CREATE TABLE t_symbol_mysql (
 symbol String,
 start_time TIMESTAMP(3),
 last_btime TIMESTAMP(3),
 end_time TIMESTAMP(3),
 start_price BIGINT,
 last_bprice BIGINT,
 end_price BIGINT,
 PRIMARY KEY (symbol,start_time) NOT ENFORCED
) WITH (
   'connector' = 'jdbc',
   'url' = 'jdbc:mysql://master:3306/bigdata?useUnicode=true&characterEncoding=utf-8&useSSL=false',
   'table-name' = 'symbol',
   'username' = 'root',
   'password'= '123456'
)


2、编写sql
监控股票的价格,从kafka中读取数据,实时监控连续下降的股票区间,将结果保存到mysql中

insert into t_symbol_mysql
SELECT T.symbol,T.start_time,T.last_btime,T.end_time,T.start_price,T.last_bprice,T.end_price
FROM t_symbol
    MATCH_RECOGNIZE ( 
      PARTITION BY symbol
      ORDER BY rowtime 
      MEASURES 
        A.rowtime as start_time,
        LAST(B.rowtime) as last_btime,
        C.rowtime as end_time,
        A.price as start_price,
        LAST(B.price) as last_bprice,
        C.price as end_price
      AFTER MATCH SKIP PAST LAST ROW
      PATTERN (A B+ C) -- A:开头 B:中间 C:结尾 
      DEFINE 
        B AS (LAST(B.price,1) IS NULL AND B.price < A.price) OR B.price < LAST(B.price,1) ,
        C AS C.price > LAST(B.price)
    ) AS T;

ACME,2022-03-26 10:00:00,12,1
ACME,2022-03-26 10:00:01,17,2
ACME,2022-03-26 10:00:02,19,1
ACME,2022-03-26 10:00:03,21,3
ACME,2022-03-26 10:00:04,25,2
ACME,2022-03-26 10:00:05,18,1
ACME,2022-03-26 10:00:06,15,1
ACME,2022-03-26 10:00:07,14,2
ACME,2022-03-26 10:00:08,24,2
ACME,2022-03-26 10:00:09,25,2
ACME,2022-03-26 10:00:15,25,2
ACME,2022-03-26 10:00:10,19,1

IDEA 实现

package com.shujia.flink.table

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment

object Demo13Symbol {
  def main(args: Array[String]): Unit = {
    val bsEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    bsEnv.setParallelism(1)

    //设置table 环境的一些参数
    val bsSettings: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() //使用blink计划器
      .inStreamingMode() //流模式
      .build()

    // 创建flink table 环境
    val bsTableEnv: StreamTableEnvironment = StreamTableEnvironment.create(bsEnv, bsSettings)

    bsTableEnv.executeSql(
      """
        |CREATE TABLE t_symbol (
        | symbol String,
        | rowtime TIMESTAMP(3),
        | price BIGINT,
        | tax BIGINT,
        | WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND
        |) WITH (
        | 'connector' = 'kafka',
        | 'topic' = 'symbol',
        | 'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
        | 'properties.group.id' = 'asdasdasd',
        | 'format' = 'csv',
        | 'scan.startup.mode' = 'latest-offset'
        |)
        |
      """.stripMargin)

    bsTableEnv.executeSql(
      """
        |
        |CREATE TABLE t_symbol_mysql (
        | symbol String,
        | start_time TIMESTAMP(3),
        | last_btime TIMESTAMP(3),
        | end_time TIMESTAMP(3),
        | start_price BIGINT,
        | last_bprice BIGINT,
        | end_price BIGINT,
        | PRIMARY KEY (symbol,start_time) NOT ENFORCED
        |) WITH (
        |   'connector' = 'jdbc',
        |   'url' = 'jdbc:mysql://master:3306/bigdata?useUnicode=true&characterEncoding=utf-8&useSSL=false',
        |   'table-name' = 'symbol',
        |   'username' = 'root',
        |   'password'= '123456'
        |)
        |
      """.stripMargin)

    bsTableEnv.executeSql(
      """
        |
        |
        |insert into t_symbol_mysql
        |SELECT T.symbol,T.start_time,T.last_btime,T.end_time,T.start_price,T.last_bprice,T.end_price
        |FROM t_symbol
        |    MATCH_RECOGNIZE (
        |      PARTITION BY symbol
        |      ORDER BY rowtime
        |      MEASURES
        |        A.rowtime as start_time,
        |        LAST(B.rowtime) as last_btime,
        |        C.rowtime as end_time,
        |        A.price as start_price,
        |        LAST(B.price) as last_bprice,
        |        C.price as end_price
        |      AFTER MATCH SKIP PAST LAST ROW
        |      PATTERN (A B+ C)
        |      DEFINE
        |        B AS (LAST(B.price,1) IS NULL AND B.price < A.price) OR B.price < LAST(B.price,1) ,
        |        C AS C.price > LAST(B.price)
        |    ) AS T
        |
      """.stripMargin)

  }
}
posted @ 2022-03-27 20:04  赤兔胭脂小吕布  阅读(847)  评论(0)    收藏  举报