Flink 模式检测(CEP)及案例
Flink 模式检测(CEP)
CEP 可以理解为类似正则表达式的东西
Flink 提供复杂事件处理(CEP)库,该库允许在事件流中进行模式检测。此外,Flink 的 SQL API 提供了一种关系式的查询表达方式,其中包含大量内置函数和基于规则的优化,可以开箱即用。
IDEA 中使用 CEP
导入依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-cep_2.11</artifactId>
<version>1.11.2</version>
</dependency>
SQL 语义
每个 MATCH_RECOGNIZE 查询都包含以下子句:
- PARTITION BY - 定义表的逻辑分区;类似于
GROUP BY操作。 - ORDER BY - 指定传入行的排序方式;这是必须的,因为模式依赖于顺序。
- MEASURES - 定义子句的输出;类似于
SELECT子句。 - ONE ROW PER MATCH - 输出方式,定义每个匹配项应产生多少行。
- AFTER MATCH SKIP - 指定下一个匹配的开始位置;这也是控制单个事件可以属于多少个不同匹配项的方法。
- PATTERN - 允许使用类似于 正则表达式 的语法构造搜索的模式。
- DEFINE - 本部分定义了模式变量必须满足的条件。
注意 目前,MATCH_RECOGNIZE 子句只能应用于追加表。此外,它也总是生成一个追加表。
贪婪量词和勉强量词
每一个量词可以是 贪婪(默认行为)的或者 勉强 的。贪婪的量词尝试匹配尽可能多的行,而勉强的量词则尝试匹配尽可能少的行。
为了说明区别,可以通过查询查看以下示例,其中贪婪量词应用于 B 变量:
SELECT *
FROM Ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY rowtime
MEASURES
C.price AS lastPrice
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B* C)
DEFINE
A AS A.price > 10,
B AS B.price < 15,
C AS C.price > 12
)
假设我们有以下输入:
symbol tax price rowtime
======= ===== ======== =====================
XYZ 1 10 2018-09-17 10:00:02
XYZ 2 11 2018-09-17 10:00:03
XYZ 1 12 2018-09-17 10:00:04
XYZ 2 13 2018-09-17 10:00:05
XYZ 1 14 2018-09-17 10:00:06
XYZ 2 16 2018-09-17 10:00:07
上面的模式将产生以下输出:
symbol lastPrice
======== ===========
XYZ 16
将 B* 修改为 B*? 的同一查询,这意味着 B* 应该是勉强的,将产生:
symbol lastPrice
======== ===========
XYZ 13
XYZ 16
模式变量 B 只匹配价格为 12 的行,而不是包含价格为 12、13 和 14 的行。
注意 模式的最后一个变量不能使用贪婪量词。因此,不允许使用类似 (A B*) 的模式。通过引入条件为 B 的人工状态(例如 C),可以轻松解决此问题。因此,你可以使用类似以下的查询:
PATTERN (A B* C)
DEFINE
A AS condA(),
B AS condB(),
C AS NOT condB()
注意 目前不支持可选的勉强量词(A?? 或者 A{0,1}?)。
Flink SQL shell 示例
SQL Client 中使用
MATCH_RECOGNIZE子句,你无需执行任何操作,因为默认情况下包含所有依赖项。
数据
1001,123
1001,23
1001,1
1001,1000
1001,300
1002,2
1002,0.5
1002,200
1002,0.5
1002,6000
CREATE TABLE `event` (
`user` String,
`money` Double,
proc_time AS PROCTIME()
) WITH (
'connector' = 'kafka',
'topic' = 'event',
'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
'properties.group.id' = 'asdasdasd',
'format' = 'csv',
'scan.startup.mode' = 'latest-offset'
);
对于一个账户,如果出现小于 $1 美元的交易后紧跟着一个大于 $500 的交易,就输出一个报警信息。
SELECT *
FROM `event`
MATCH_RECOGNIZE ( -- 模式检测,在数据流中根据一个规则匹配数据
PARTITION BY `user` -- 分组字段
ORDER BY proc_time -- 数据的顺序,只能是事件时间或者处理时间
MEASURES -- 匹配成功之后返回哪些数据
A.money AS amoney,
B.money AS bmoney
PATTERN (A B) WITHIN INTERVAL '5' SECOND -- 匹配模式,以及时间限制(官网中找的参数,5秒之内完成)
DEFINE -- 具体的匹配规则
A AS `money` < 1,
B AS `money` > 500
) AS T
对于一个账户,如果连续出现小于 $1 美元的交易后紧跟着一个大于 $500 的交易,就输出一个报警信息。
SELECT *
FROM `event`
MATCH_RECOGNIZE (
PARTITION BY `user`
ORDER BY proc_time
MEASURES
FIRST(A.money) as fa, -- FIRST():取最前的数据
LAST(A.money) AS la, -- LAST():取最新的数据,默认的
AVG(A.money) as avgMoney,
B.money AS bmoney
AFTER MATCH SKIP PAST LAST ROW -- 在当前匹配的最后一行之后的下一行继续模式匹配。不指定的话这边会出现三个结果
PATTERN (A+ B) WITHIN INTERVAL '5' SECOND -- A+ : A 出现1次或者多次,类似正则表达式
DEFINE
A AS `money` < 1,
B AS `money` > 500
) AS T;
1002,0.5
1002,0.6
1002,0.7
1002,6000
对于一个账户,如果连续出现多次交易的平均值小于1 后紧跟着一个大于 $500 的交易,就输出一个报警信息。
SELECT *
FROM `event`
MATCH_RECOGNIZE (
PARTITION BY `user`
ORDER BY proc_time
MEASURES
FIRST(A.money) as fa,
LAST(A.money) AS la,
AVG(A.money) as avgMoney,
B.money AS bmoney
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A+ B) WITHIN INTERVAL '5' SECOND
DEFINE
A AS AVG(A.money) < 1,
B AS B.money > 500
) AS T;
1002,0.5
1002,0.6
1002,1.2
1002,6000
每个 MATCH_RECOGNIZE 查询都包含以下子句:
PARTITION BY - 定义表的逻辑分区;类似于 GROUP BY 操作。
ORDER BY - 指定传入行的排序方式;这是必须的,因为模式依赖于顺序。
MEASURES - 定义子句的输出;类似于 SELECT 子句。
ONE ROW PER MATCH - 输出方式,定义每个匹配项应产生多少行。
AFTER MATCH SKIP - 指定下一个匹配的开始位置;这也是控制单个事件可以属于多少个不同匹配项的方法。
PATTERN - 允许使用类似于 正则表达式 的语法构造搜索的模式。
DEFINE - 本部分定义了模式变量必须满足的条件。
官网中给了很多例子,到时候用到什么可以去官网找,不用记代码
PATTERN -- 注意 不支持可能产生空匹配的模式。此类模式的示例如
PATTERN (A*),PATTERN (A? B*),PATTERN (A{0,} B{0,} C*)等。
模式匹配 案例
LAST(B.price,1) -- 取上一条数据,如果没有则返回null
ACME,2022-03-26 10:00:00,12,1
ACME,2022-03-26 10:00:01,17,2
ACME,2022-03-26 10:00:02,19,1
ACME,2022-03-26 10:00:03,21,3
ACME,2022-03-26 10:00:04,25,2
ACME,2022-03-26 10:00:05,18,1
ACME,2022-03-26 10:00:06,15,1
ACME,2022-03-26 10:00:07,14,2
ACME,2022-03-26 10:00:08,24,2
ACME,2022-03-26 10:00:09,25,2
ACME,2022-03-26 10:00:10,19,1
symbol: String # 股票的代号
rowtime: TIMESTAMP(3) # 更改这些值的时间点
price: BIGINT # 股票的价格
tax: BIGINT # 股票应纳税额
监控股票的价格,从kafka中读取数据,实时监控连续下降的股票区间,将结果保存到mysql中
1、创建kafka source 表
CREATE TABLE t_symbol (
symbol String,
rowtime TIMESTAMP(3),
price BIGINT,
tax BIGINT,
WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'symbol',
'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
'properties.group.id' = 'asdasdasd',
'format' = 'csv',
'scan.startup.mode' = 'latest-offset'
);
创建mysql sink表 -- 需要在MySQL中创建对应的表
CREATE TABLE t_symbol_mysql (
symbol String,
start_time TIMESTAMP(3),
last_btime TIMESTAMP(3),
end_time TIMESTAMP(3),
start_price BIGINT,
last_bprice BIGINT,
end_price BIGINT,
PRIMARY KEY (symbol,start_time) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://master:3306/bigdata?useUnicode=true&characterEncoding=utf-8&useSSL=false',
'table-name' = 'symbol',
'username' = 'root',
'password'= '123456'
)
2、编写sql
监控股票的价格,从kafka中读取数据,实时监控连续下降的股票区间,将结果保存到mysql中
insert into t_symbol_mysql
SELECT T.symbol,T.start_time,T.last_btime,T.end_time,T.start_price,T.last_bprice,T.end_price
FROM t_symbol
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY rowtime
MEASURES
A.rowtime as start_time,
LAST(B.rowtime) as last_btime,
C.rowtime as end_time,
A.price as start_price,
LAST(B.price) as last_bprice,
C.price as end_price
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B+ C) -- A:开头 B:中间 C:结尾
DEFINE
B AS (LAST(B.price,1) IS NULL AND B.price < A.price) OR B.price < LAST(B.price,1) ,
C AS C.price > LAST(B.price)
) AS T;
ACME,2022-03-26 10:00:00,12,1
ACME,2022-03-26 10:00:01,17,2
ACME,2022-03-26 10:00:02,19,1
ACME,2022-03-26 10:00:03,21,3
ACME,2022-03-26 10:00:04,25,2
ACME,2022-03-26 10:00:05,18,1
ACME,2022-03-26 10:00:06,15,1
ACME,2022-03-26 10:00:07,14,2
ACME,2022-03-26 10:00:08,24,2
ACME,2022-03-26 10:00:09,25,2
ACME,2022-03-26 10:00:15,25,2
ACME,2022-03-26 10:00:10,19,1
IDEA 实现
package com.shujia.flink.table
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
object Demo13Symbol {
def main(args: Array[String]): Unit = {
val bsEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
bsEnv.setParallelism(1)
//设置table 环境的一些参数
val bsSettings: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner() //使用blink计划器
.inStreamingMode() //流模式
.build()
// 创建flink table 环境
val bsTableEnv: StreamTableEnvironment = StreamTableEnvironment.create(bsEnv, bsSettings)
bsTableEnv.executeSql(
"""
|CREATE TABLE t_symbol (
| symbol String,
| rowtime TIMESTAMP(3),
| price BIGINT,
| tax BIGINT,
| WATERMARK FOR rowtime AS rowtime - INTERVAL '5' SECOND
|) WITH (
| 'connector' = 'kafka',
| 'topic' = 'symbol',
| 'properties.bootstrap.servers' = 'master:9092,node1:9092,node2:9092',
| 'properties.group.id' = 'asdasdasd',
| 'format' = 'csv',
| 'scan.startup.mode' = 'latest-offset'
|)
|
""".stripMargin)
bsTableEnv.executeSql(
"""
|
|CREATE TABLE t_symbol_mysql (
| symbol String,
| start_time TIMESTAMP(3),
| last_btime TIMESTAMP(3),
| end_time TIMESTAMP(3),
| start_price BIGINT,
| last_bprice BIGINT,
| end_price BIGINT,
| PRIMARY KEY (symbol,start_time) NOT ENFORCED
|) WITH (
| 'connector' = 'jdbc',
| 'url' = 'jdbc:mysql://master:3306/bigdata?useUnicode=true&characterEncoding=utf-8&useSSL=false',
| 'table-name' = 'symbol',
| 'username' = 'root',
| 'password'= '123456'
|)
|
""".stripMargin)
bsTableEnv.executeSql(
"""
|
|
|insert into t_symbol_mysql
|SELECT T.symbol,T.start_time,T.last_btime,T.end_time,T.start_price,T.last_bprice,T.end_price
|FROM t_symbol
| MATCH_RECOGNIZE (
| PARTITION BY symbol
| ORDER BY rowtime
| MEASURES
| A.rowtime as start_time,
| LAST(B.rowtime) as last_btime,
| C.rowtime as end_time,
| A.price as start_price,
| LAST(B.price) as last_bprice,
| C.price as end_price
| AFTER MATCH SKIP PAST LAST ROW
| PATTERN (A B+ C)
| DEFINE
| B AS (LAST(B.price,1) IS NULL AND B.price < A.price) OR B.price < LAST(B.price,1) ,
| C AS C.price > LAST(B.price)
| ) AS T
|
""".stripMargin)
}
}

浙公网安备 33010602011771号