Flink学习笔记——Flink MySQL CDC
1.Flink CDC介绍
Flink CDC提供了一系列connector,用于从其他数据源获取变更数据(change data capture),其中的Flink MySQL CDC基于Debezium
官方文档
https://ververica.github.io/flink-cdc-connectors/release-2.3/content/about.html
官方github
https://github.com/ververica/flink-cdc-connectors
Flink和Flink CDC的版本对应关系参考:
https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/flink-sources/overview/
各种数据源使用案例,参考:
基于 AWS S3、EMR Flink、Presto 和 Hudi 的实时数据湖仓 – 使用 EMR 迁移 CDH
2.Flink MySQL CDC
原理
Flink MySQL CDC官方文档:https://github.com/apache/flink-cdc/blob/master/docs/content/docs/connectors/flink-sources/mysql-cdc.md
MySQL binlog可以参考: MySQL学习笔记——binlog
Flink MySQL CDC在2.0版本之后有比较大的性能提升:
1.无锁
2.并发同步(在大表同步的时候可以加速)
3.支持snapshot阶段checkpoint(在snapshot同步阶段失败后可以继续同步)
使用datastream api
package com.bigdata.flink;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.cdc.connectors.mysql.source.MySqlSource;
import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema;
public class MysqlCdcExample {
public static void main(String[] args) throws Exception {
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("localhost")
.port(3306)
.databaseList("test") // set captured database, If you need to synchronize the whole database, Please set tableList to ".*".
.tableList("test.user") // set captured table
.username("root")
.password("123456")
.serverTimeZone("America/Danmarkshavn")
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
.build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// enable checkpoint
env.enableCheckpointing(3000);
env
.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
// set 1 parallel source tasks
.setParallelism(1)
.print().setParallelism(1); // use parallelism 1 for sink
env.execute("Print MySQL Snapshot + Binlog");
}
}
insert一条MySQL数据,输出如下
{
"before":null,
"after":{
"id":"AQ==",
"username":"test",
"email":"test@test.com"
},
"source":{
"version":"1.9.8.Final",
"connector":"mysql",
"name":"mysql_binlog_source",
"ts_ms":0,
"snapshot":"false",
"db":"test",
"sequence":null,
"table":"user",
"server_id":0,
"gtid":null,
"file":"",
"pos":0,
"row":0,
"thread":null,
"query":null
},
"op":"r",
"ts_ms":1723951109745,
"transaction":null
}
update这条数据,输出如下
{
"before":{
"id":"AQ==",
"username":"test",
"email":"test@test.com"
},
"after":{
"id":"AQ==",
"username":"test123",
"email":"test@test.com"
},
"source":{
"version":"1.9.8.Final",
"connector":"mysql",
"name":"mysql_binlog_source",
"ts_ms":1723951151000,
"snapshot":"false",
"db":"test",
"sequence":null,
"table":"user",
"server_id":1,
"gtid":null,
"file":"mysql-bin.000005",
"pos":1307,
"row":0,
"thread":75,
"query":null
},
"op":"u",
"ts_ms":1723951151768,
"transaction":null
}
删除这条数据,输出如下
{
"before":{
"id":"AQ==",
"username":"test123",
"email":"test@test.com"
},
"after":null,
"source":{
"version":"1.9.8.Final",
"connector":"mysql",
"name":"mysql_binlog_source",
"ts_ms":1723953255000,
"snapshot":"false",
"db":"test",
"sequence":null,
"table":"user",
"server_id":1,
"gtid":null,
"file":"mysql-bin.000005",
"pos":1627,
"row":0,
"thread":99,
"query":null
},
"op":"d",
"ts_ms":1723953255153,
"transaction":null
}
使用flink SQL
报错
1.Caused by: org.apache.flink.table.api.ValidationException: The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai. Specify the right server-time-zone to avoid inconsistencies for time-related fields.
这个因为没有给mysql cdc任务指定时区,可以使用如下命令查看mysql的时区
mysql> show variables like '%time_zone%'; +------------------+--------+ | Variable_name | Value | +------------------+--------+ | system_time_zone | UTC | | time_zone | SYSTEM | +------------------+--------+ 2 rows in set (0.02 sec) mysql> select now(); +---------------------+ | now() | +---------------------+ | 2024-08-17 15:29:48 | +---------------------+ 1 row in set (0.00 sec)
假设mysql的时区是UTC,则需要指定和mysql时区一样的时区配置,如下
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("localhost")
.port(55000)
.databaseList("default") // set captured database, If you need to synchronize the whole database, Please set tableList to ".*".
.tableList("default.test") // set captured table
.username("root")
.password("123456")
.serverTimeZone("America/Danmarkshavn")
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
.build();
2.Caused by: org.apache.flink.util.FlinkRuntimeException: Cannot read the binlog filename and position via 'SHOW MASTER STATUS'. Make sure your server is correctly configured
需要假设mysql的binlog是否是开启ON状态,如果是OFF状态的话则会报错
mysql> show variables like 'log_bin';
MySQL CDC业界使用案例
1.基于 Flink CDC + Hudi 湖仓一体方案实践 (37互娱)
2.Flink CDC + Hudi 海量数据入湖在顺丰的实践 (顺丰)
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/5321328.html

浙公网安备 33010602011771号