flink代码最佳实践
总体代码流程
- 获取一个执行环境
- 加载/创建初始数据
- 指定数据上的转换
- 指定计算结果放在哪里
- 触发程序执行
实例pom:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.alibaba.ververica</groupId> <artifactId>ververica-connector-hologres-demo</artifactId> <version>1.17-vvr-8.0.11-1</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <flink.version>1.17.2</flink.version> <vvr.version>1.17-vvr-8.0.11-1</vvr.version> <target.java.version>1.8</target.java.version> <maven.compiler.source>${target.java.version}</maven.compiler.source> <maven.compiler.target>${target.java.version}</maven.compiler.target> <log4j.version>2.14.1</log4j.version> </properties> <dependencies> <!-- Apache Flink dependencies --> <!-- These dependencies are provided, because they should not be packaged into the JAR file. --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-common</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-runtime</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>com.alibaba.ververica</groupId> <artifactId>ververica-connector-hologres</artifactId> <version>${vvr.version}</version> </dependency> </dependencies> <build> <plugins> <!-- Java Compiler --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>${target.java.version}</source> <target>${target.java.version}</target> </configuration> </plugin> <!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. --> <!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.1</version> <executions> <!-- Run shade goal on package phase --> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <artifactSet> <excludes> <exclude>org.apache.flink:force-shading</exclude> <exclude>com.google.code.findbugs:jsr305</exclude> <exclude>org.slf4j:*</exclude> <exclude>org.apache.logging.log4j:*</exclude> </excludes> </artifactSet> <filters> <filter> <!-- Do not copy the signatures in the META-INF folder. Otherwise, this might cause SecurityExceptions when using the JAR. --> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
flink配置文件:https://www.cnblogs.com/jiangbei/p/19506138
获取执行环境
通常使用固定的获取方式,根据建议定义变量为final:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
加载数据源
flink 1.15以后推荐使用新的API:fromSource:
kafka source:
简单读取string:
KafkaSource<String> source = KafkaSource.<String>builder() .setBootstrapServers("kafka-broker:9092") .setTopics("input-topic") .setGroupId("my-group") .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new SimpleStringSchema()) .build(); DataStream<String> data = env.fromSource( source, WatermarkStrategy.noWatermarks(), // 指定水印策略 "Kafka Source" );
如果要在加载时转换为POJO,需要自定义序列化器:
样例数据:
{ "id": 231762, "config_type": "JAVA服务-AutoMq", "indicator_name": "kafka_log_end_offset_cold", "instance_id": "kf-gan8h3m70kry7cgu", "job_id": "4", "request_time": "2026-02-02 11:38:34", "response_data": "{\"message\":{\"traceId\":\"69801c3a00000000c0a221254446fb0c\",\"code\":200,\"success\":true,\"errorCode\":\"\",\"message\":\"\",\"content\":{\"data\":[{\"index_store_type\":\"\",\"query_progress\":{\"scanned_compressed_bytes\":0,\"nanos_to_finish\":0,\"total_rows\":0,\"scanned_uncompressed_bytes\":0,\"total_compressed_bytes\":0,\"total_percentage\":0,\"nanos_from_submitted\":0,\"total_uncompressed_bytes\":0,\"scanned_rows\":0,\"nanos_from_started\":0},\"cost\":\"\",\"next_cursor_time\":0,\"query_status\":\"\",\"index_names\":\"\",\"scan_completed\":false,\"async_id\":\"\",\"query_type\":\"\",\"sample\":0,\"is_cross_ws\":true,\"is_running\":false,\"is_data_latency\":false,\"next_cursor_token\":\"\",\"datasource\":\"\",\"series\":[{\"columns\":[\"time\",\"max(sum(kafka_log_end_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\"}) - sum(kafka_group_commit_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\",consumer_group=~\\\"flink-paimon-prod-0\\\"}), 0)\"],\"values\":[[1770003454156,71852]],\"column_names\":[\"time\",\"max(sum(kafka_log_end_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\"}) - sum(kafka_group_commit_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\",consumer_group=~\\\"flink-paimon-prod-0\\\"}), 0)\"]}],\"interval\":0,\"column_names\":[\"max(sum(kafka_log_end_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\"}) - sum(kafka_group_commit_offset{instance_id=~\\\"kf-gan8h3m70kry7cgu\\\",topic=~\\\"geea3_rawdata_prod\\\",consumer_group=~\\\"flink-paimon-prod-0\\\"}), 0)\"],\"scan_index\":\"\",\"window\":0,\"complete\":false,\"index_name\":\"\"}],\"declaration\":{\"business\":\"\",\"organization\":\"default_private_organization\"}}}}", "dt": "20260202" }
原始POJO:
import java.io.Serializable;
/**
* Kafka消息POJO类
* 字段名与JSON中的key保持一致(使用下划线命名)
*/
public class KafkaMessage implements Serializable {
private static final long serialVersionUID = 1L;
private Long id;
private String config_type;
private String indicator_name;
private String instance_id;
private String job_id;
private String request_time;
private String response_data; // 保持JSON字符串,后续处理
private String dt;
// 必须有无参构造函数
public KafkaMessage() {}
// 简洁版getter/setter(生产环境建议用Lombok)
public Long getId() { return id; }
public void setId(Long id) { this.id = id; }
public String getConfig_type() { return config_type; }
public void setConfig_type(String config_type) { this.config_type = config_type; }
public String getIndicator_name() { return indicator_name; }
public void setIndicator_name(String indicator_name) { this.indicator_name = indicator_name; }
public String getInstance_id() { return instance_id; }
public void setInstance_id(String instance_id) { this.instance_id = instance_id; }
public String getJob_id() { return job_id; }
public void setJob_id(String job_id) { this.job_id = job_id; }
public String getRequest_time() { return request_time; }
public void setRequest_time(String request_time) { this.request_time = request_time; }
public String getResponse_data() { return response_data; }
public void setResponse_data(String response_data) { this.response_data = response_data; }
public String getDt() { return dt; }
public void setDt(String dt) { this.dt = dt; }
@Override
public String toString() {
return String.format("KafkaMessage{id=%d, instance='%s', indicator='%s', dt='%s'}",
id, instance_id, indicator_name, dt);
}
}
序列化类:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.DeserializationFeature;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* Kafka消息反序列化器(只处理value部分)
* 使用Flink内置Jackson,无需额外依赖
*/
public class KafkaMessageDeserializer implements DeserializationSchema<KafkaMessage> {
private static final long serialVersionUID = 1L;
// 使用静态Holder模式实现单例ObjectMapper
private static final class ObjectMapperHolder {
static final ObjectMapper INSTANCE = createObjectMapper();
private static ObjectMapper createObjectMapper() {
ObjectMapper mapper = new ObjectMapper();
// 忽略JSON中未知字段(防止字段变化导致解析失败)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
// 允许空字符串转换为null
mapper.configure(DeserializationFeature.ACCEPT_EMPTY_STRING_AS_NULL_OBJECT, true);
return mapper;
}
}
@Override
public KafkaMessage deserialize(byte[] message) throws IOException {
try {
// 直接使用单例ObjectMapper反序列化
return ObjectMapperHolder.INSTANCE.readValue(message, KafkaMessage.class);
} catch (Exception e) {
// 返回错误标记的消息,避免任务失败
KafkaMessage errorMessage = new KafkaMessage();
errorMessage.setId(-1L);
errorMessage.setIndicatorName("parse_error");
// 保留原始消息前100字符便于排查
String rawMsg = new String(message, "UTF-8");
errorMessage.setResponseData(rawMsg.length() > 100 ?
rawMsg.substring(0, 100) + "..." : rawMsg);
return errorMessage;
}
}
@Override
public boolean isEndOfStream(KafkaMessage nextElement) {
return false; // Kafka流是无限的
}
@Override
public TypeInformation<KafkaMessage> getProducedType() {
return TypeInformation.of(KafkaMessage.class);
}
// 可选:open方法用于初始化指标等
@Override
public void open(InitializationContext context) {
// 可以在这里初始化监控指标
// context.getMetricGroup().counter("deserialize_count");
}
}
创建source:
KafkaSource.<KafkaMessage>builder() .setBootstrapServers("localhost:9092") // 生产环境替换为实际地址 .setTopics("monitoring-data-topic") // 你的监控数据主题 .setGroupId("flink-monitoring-group") // 使用value-only反序列化器(只处理value部分) .setValueOnlyDeserializer(new KafkaMessageDeserializer()) // 从最新位置开始消费 .setStartingOffsets(OffsetsInitializer.latest()) // 重要:设置偏移量提交方式(与检查点绑定) .setProperty("commit.offsets.on.checkpoint", "true") // 动态发现新分区(每分钟) .setProperty("partition.discovery.interval.ms", "60000") // 消费者配置 .setProperty("fetch.max.wait.ms", "500") // 拉取最大等待时间 .setProperty("fetch.min.bytes", "1") // 最小拉取字节数 .setProperty("max.poll.records", "500") // 每次拉取最大记录数 .build();
文件数据流source:
简单的数据结构可以用内置的:Tuple2
package org.example; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.connector.file.src.FileSource; import org.apache.flink.connector.file.src.reader.TextLineInputFormat; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.core.fs.Path; import scala.Tuple2; public class FilterTest { public static void main(String[] args) throws Exception{ final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // 2. 创建FileSource(静态模式) FileSource<String> fileSource = FileSource .forRecordStreamFormat( new TextLineInputFormat(), // 按行读取格式 new Path("D:/tmp/t.txt") // 文件路径 ) .processStaticFileSet() // 静态文件处理模式 .build(); DataStream<String> lines = env.fromSource( fileSource, WatermarkStrategy.noWatermarks(), "windows-file-reader" ); lines.map(new MapFunction<String, Tuple2<String, String>>() { @Override public Tuple2<String, String> map(String value) throws Exception { String[] split = value.split("="); return new Tuple2<>(split[0], split[1]); } }).print("file-test"); env.execute(); } }
指定数据上的转换
算子:
map-一进一出,改造完出来重新做人
DataStream<String> lines = env.fromSource( fileSource, WatermarkStrategy.noWatermarks(), "windows-file-reader" ); lines.map(new MyMapFunction()).print("file-test");
算子一般单独定义一个类,使用lambda或者方法引用需要特别注意类型擦除的问题:
package service; import bean.WaterSenor; import org.apache.flink.api.common.functions.MapFunction; public class MyMapFunction implements MapFunction<WaterSenor, String> { @Override public String map(WaterSenor value) throws Exception { return value.getId() + ":" + value.getVc(); } }
filter过滤,为true则保留,否则过滤掉
同理,实现一个接口,泛型是要过滤的数据的类型:
package service; import bean.WaterSenor; import org.apache.flink.api.common.functions.FilterFunction; public class MyFilterFunction implements FilterFunction<WaterSenor> { @Override public boolean filter(WaterSenor value) throws Exception { return "sensor_1".equalsIgnoreCase(value.getId()); } }
flatMap,扁平映射,一进多出,比如wordcount的时候,一行返回多个单词
也是实现一个类,泛型表示输入输出类型:
package service; import bean.WaterSenor; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.util.Collector; public class MyFlatMapFunction implements FlatMapFunction<WaterSenor, String> { @Override public void flatMap(WaterSenor value, Collector<String> out) throws Exception { if ("sensor_1".equalsIgnoreCase(value.getId())) { out.collect(value.getTs().toString()); } else if ("sensor_2".equalsIgnoreCase(value.getId())){ out.collect(value.getTs().toString()); out.collect(value.getVc().toString()); } } }
keyby算子:
keyBy() 将无界流(DataStream)按 Key 分区,使具有相同 Key 的数据被发送到同一个并行任务(Task)中处理。
Flink 支持多种方式指定 Key,推荐使用 Lambda 表达式 + 方法引用(Java 8 风格)。
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import java.util.Arrays; public class KeyByExample1 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 模拟数据流 env.fromElements( new UserEvent("u1", "click", 1000), new UserEvent("u2", "buy", 1005), new UserEvent("u1", "view", 1010), new UserEvent("u2", "click", 1015), new UserEvent("u3", "buy", 1020) ) .keyBy(userEvent -> userEvent.userId) // ✅ Java 8 Lambda:按 userId 分组 .print(); // 每个 key 的数据会聚合到同一个 task env.execute("KeyBy by Field (Lambda)"); } }
通常keyby和窗口一起使用:
import org.apache.flink.streaming.api.TimeCharacteristic; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor; import org.apache.flink.streaming.api.windowing.time.Time; public class OrderAmountByRegion { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); // 启用事件时间 DataStream<OrderEvent> orders = env.fromElements( new OrderEvent("beijing", 150.0, 1000000), // 1970-01-12 03:46:40 new OrderEvent("shanghai", 200.0, 1005000), new OrderEvent("beijing", 300.0, 1010000), new OrderEvent("shanghai", 180.0, 1020000), new OrderEvent("beijing", 250.0, 1030000) // 下一小时 ) .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor<OrderEvent>(Time.seconds(5)) { @Override public long extractTimestamp(OrderEvent element) { return element.eventTime; } } ); orders .keyBy("region") .window(TumblingEventTimeWindows.of(Time.hours(1))) .sum("amount") .print("💰 每小时订单总额:"); env.execute("Order Amount by Region (Event Time)"); } }
process算子:
ProcessFunction 是 Flink 中最底层的流处理 API,它提供了对数据流的细粒度控制能力。与常规的 map、filter、flatMap 等算子不同,ProcessFunction 可以:
- 访问时间属性(事件时间、处理时间)
- 管理状态(键控状态、算子状态)
- 注册定时器(基于事件时间或处理时间)
- 处理迟到数据
- 使用侧输出流
核心要实现的方法:
/** * 用于处理元素 * value:当前流中的输入元素,也就是正在处理的数据,类型与流中数据类 * ctx:类型是 ProcessFunction 中定义的内部抽象类 Context,表示当前运行的上下文,可以获取到当前的时间戳, * 并提供了用于查询时间和注册定时器的“定时服务”(TimerService),以及可以将数据发送到“侧输出流”(side output)的方法.output()。 * out:"收集器"(类型为 Collector),用于返回输出数据。使用方式与 flatMap算子中的收集器完全一样,直接调用 out.collect()方法就可以向下游发出一个数据。 */ @Override public void processElement(I value, Context ctx, Collector<O> out) throws Exception {}
使用示例:
import org.apache.flink.api.common.state.ValueState; import org.apache.flink.api.common.state.ValueStateDescriptor; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; import org.apache.flink.util.OutputTag; import java.util.Random; /** * ProcessFunction 完整示例 * 演示状态管理、定时器、侧输出等功能 */ public class ProcessFunctionExample { // 定义侧输出标签用于异常数据 private static final OutputTag<String> ERROR_TAG = new OutputTag<String>("errors") {}; public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 创建模拟传感器数据流 DataStream<SensorReading> sensorData = env.fromElements( new SensorReading("sensor1", 25.3, System.currentTimeMillis()), new SensorReading("sensor1", 28.1, System.currentTimeMillis() + 1000), new SensorReading("sensor1", 32.5, System.currentTimeMillis() + 2000), // 高温告警 new SensorReading("sensor2", 18.7, System.currentTimeMillis()), new SensorReading("sensor2", 45.9, System.currentTimeMillis() + 1000), // 高温告警 new SensorReading("sensor1", 22.8, System.currentTimeMillis() + 3000), new SensorReading("sensor3", -10.0, System.currentTimeMillis()), // 异常数据 new SensorReading("sensor1", 29.5, System.currentTimeMillis() + 4000) ); // 使用 ProcessFunction 处理数据 SingleOutputStreamOperator<String> processedStream = sensorData .keyBy(SensorReading::getSensorId) .process(new SensorProcessFunction()); // 获取主输出流 processedStream.print("Main Output"); // 获取侧输出流(异常数据) processedStream.getSideOutput(ERROR_TAG).print("Error Output"); env.execute("Process Function Example"); } /** * 自定义 ProcessFunction 实现 */ public static class SensorProcessFunction extends ProcessFunction<SensorReading, String> { // 状态声明:用于存储每个传感器的上次温度值 private transient ValueState<Double> lastTempState; // 状态声明:用于存储定时器时间戳 private transient ValueState<Long> timerState; @Override public void open(Configuration parameters) throws Exception { // 初始化状态 ValueStateDescriptor<Double> lastTempDesc = new ValueStateDescriptor<>("lastTemp", Types.DOUBLE); lastTempState = getRuntimeContext().getState(lastTempDesc); ValueStateDescriptor<Long> timerDesc = new ValueStateDescriptor<>("timerState", Types.LONG); timerState = getRuntimeContext().getState(timerDesc); } @Override public void processElement(SensorReading reading, Context ctx, Collector<String> out) throws Exception { // 1. 数据验证 - 使用侧输出处理异常数据 if (reading.getTemperature() < -50 || reading.getTemperature() > 100) { ctx.output(ERROR_TAG, "Invalid temperature reading: " + reading); return; } // 2. 状态管理 - 获取上次温度值 Double lastTemp = lastTempState.value(); if (lastTemp == null) { lastTemp = 0.0; } // 3. 业务逻辑 - 温度变化检测 double tempChange = Math.abs(reading.getTemperature() - lastTemp); if (tempChange > 5.0) { out.collect("ALERT: Temperature changed significantly - " + reading.getSensorId() + ": " + lastTemp + " -> " + reading.getTemperature()); } else { out.collect("Normal reading: " + reading.getSensorId() + " - " + reading.getTemperature()); } // 4. 状态更新 lastTempState.update(reading.getTemperature()); // 5. 定时器管理 - 注册10秒后的定时器(基于处理时间) long currentTime = ctx.timerService().currentProcessingTime(); long timerTime = currentTime + 10000; // 10秒后 // 取消之前的定时器(如果有) Long currentTimer = timerState.value(); if (currentTimer != null) { ctx.timerService().deleteProcessingTimeTimer(currentTimer); } // 注册新定时器 ctx.timerService().registerProcessingTimeTimer(timerTime); timerState.update(timerTime); // 6. 访问时间戳(如果是事件时间) // Long eventTimestamp = ctx.timestamp(); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception { // 定时器触发时的处理逻辑 Double lastTemp = lastTempState.value(); out.collect("TIMER: No data received for sensor in 10 seconds. Last temp: " + lastTemp); // 清理定时器状态 timerState.clear(); } } /** * 传感器读数数据类 */ public static class SensorReading { private String sensorId; private double temperature; private long timestamp; public SensorReading() {} public SensorReading(String sensorId, double temperature, long timestamp) { this.sensorId = sensorId; this.temperature = temperature; this.timestamp = timestamp; } public String getSensorId() { return sensorId; } public double getTemperature() { return temperature; } public long getTimestamp() { return timestamp; } @Override public String toString() { return sensorId + ": " + temperature + "°C @" + timestamp; } } }
rich算子
Rich Function 是 Flink 中一类特殊的函数接口,它们在普通函数的基础上增加了生命周期管理和运行时上下文访问的能力。所有以 Rich 开头的函数都继承自 AbstractRichFunction。
适用场景
- 状态管理需求:需要维护跨多个记录的状态
- 资源初始化:需要连接数据库、网络资源等
- 参数配置:需要从配置文件中读取参数
- 性能监控:需要在算子级别进行性能统计
- 资源清理:需要确保资源正确释放
函数结构:
public class MyRichMapper extends RichMapFunction<IN, OUT> { @Override public void open(Configuration parameters) { // 初始化 } @Override public OUT map(IN value) { // 转换逻辑 } @Override public void close() { // 清理 }
示例:
import org.apache.flink.api.common.functions.RichMapFunction; import org.apache.flink.configuration.Configuration; import java.sql.*; import java.util.logging.Logger; public class DatabaseEnrichmentMapper extends RichMapFunction<SensorReading, EnhancedSensorReading> { private static final Logger logger = Logger.getLogger(DatabaseEnrichmentMapper.class.getName()); // 数据库连接(transient 避免序列化) private transient Connection connection; // 数据库配置(建议从配置文件或环境变量读取) private final String dbUrl = "jdbc:mysql://localhost:3306/sensor_db?useSSL=false&serverTimezone=UTC"; private final String username = "root"; private final String password = "your_password"; // 生产环境建议使用 Secret 管理 // 预编译 SQL(提升性能) private transient PreparedStatement selectSensorStmt; @Override public void open(Configuration parameters) throws Exception { logger.info("=== DatabaseEnrichmentMapper 开始初始化数据库连接 ==="); try { // 1. 加载 JDBC 驱动(Java 8+ 可省略,但显式加载更安全) Class.forName("com.mysql.cj.jdbc.Driver"); // 2. 建立数据库连接 connection = DriverManager.getConnection(dbUrl, username, password); // 3. 预编译 SQL 查询语句(避免每次拼接 SQL) String sql = "SELECT location, model, factory FROM sensor_metadata WHERE sensor_id = ?"; selectSensorStmt = connection.prepareStatement(sql); logger.info("✅ 数据库连接成功,预编译语句初始化完成"); } catch (ClassNotFoundException e) { throw new RuntimeException("MySQL JDBC 驱动未找到,请检查依赖", e); } catch (SQLException e) { throw new RuntimeException("数据库连接失败: " + e.getMessage(), e); } } @Override public EnhancedSensorReading map(SensorReading reading) throws Exception { // 4. 使用预编译语句查询传感器元信息 selectSensorStmt.setString(1, reading.sensorId); ResultSet rs = selectSensorStmt.executeQuery(); String location = "未知位置"; String model = "未知型号"; String factory = "未知工厂"; if (rs.next()) { location = rs.getString("location"); model = rs.getString("model"); factory = rs.getString("factory"); } else { logger.warning("⚠️ 未找到传感器 " + reading.sensorId + " 的元信息"); } rs.close(); // 5. 返回增强后的数据 return new EnhancedSensorReading(reading, location, model, factory); } @Override public void close() throws Exception { logger.info("=== DatabaseEnrichmentMapper 正在关闭资源 ==="); // 6. 优雅关闭资源(重要!避免连接泄漏) if (selectSensorStmt != null) { selectSensorStmt.close(); } if (connection != null) { connection.close(); } logger.info("✅ 数据库连接已关闭"); } }
侧输出流
侧输出是 Flink 中一个非常强大的功能,它允许你从一个数据流处理函数中额外输出多条数据流。这些额外的流可以与主输出流拥有不同的数据类型。
核心思想:在处理一条输入记录时,除了可以产生零条、一条或多条记录到主输出(Main Output)外,你还可以将任意数量的记录发送到任意数量的命名侧输出流(Side Outputs)中。
定义侧输出标签:
import org.apache.flink.util.OutputTag; // 1. 定义一个侧输出标签,用于输出奇数。标签ID为`odd-numbers`,侧输出流的数据类型为 Integer。 private static final OutputTag<Integer> ODD_OUTPUT_TAG = new OutputTag<Integer>("odd-numbers"){}; // 2. 定义另一个侧输出标签,用于输出格式错误的记录。数据类型为 String。 private static final OutputTag<String> ERROR_OUTPUT_TAG = new OutputTag<String>("parse-errors"){};
使用底层算子process进行输出:
import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; public class SideOutputExample { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 模拟输入数据:包含数字和一条错误字符串 DataStream<String> source = env.fromElements("1", "2", "3", "4", "five", "6"); // 使用 ProcessFunction SingleOutputStreamOperator<Integer> mainStream = source.process(new ProcessFunction<String, Integer>() { @Override public void processElement(String value, Context ctx, Collector<Integer> out) throws Exception { try { // 1. 尝试将字符串转为整数 int intValue = Integer.parseInt(value); // 2. 业务逻辑:偶数 -> 主流;奇数 -> 侧输出流 (ODD_OUTPUT_TAG) if (intValue % 2 == 0) { out.collect(intValue); // 发射到主流 } else { ctx.output(ODD_OUTPUT_TAG, intValue); // 发射到奇数侧流 } } catch (NumberFormatException e) { // 3. 转换失败 -> 侧输出到错误流 (ERROR_OUTPUT_TAG) ctx.output(ERROR_OUTPUT_TAG, "无法解析的字符串: \"" + value + "\""); } } }); // 第三步:从主流中获取侧输出流 DataStream<Integer> oddSideStream = mainStream.getSideOutput(ODD_OUTPUT_TAG); DataStream<String> errorSideStream = mainStream.getSideOutput(ERROR_OUTPUT_TAG); // 打印输出 mainStream.print("主流 (偶数):"); oddSideStream.print("侧流 (奇数):"); errorSideStream.print("侧流 (错误):"); env.execute("Flink Side Output Example"); } }
完整主程序示例:
import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.api.java.tuple.Tuple3; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; import org.apache.flink.util.OutputTag; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import java.time.Duration; import java.util.Properties; /** * Flink侧输出流完整示例 * 功能:处理订单数据流,将异常数据和大额订单分离到侧输出流 */ public class SideOutputExample { // 定义侧输出标签 // 1. 异常数据侧输出流(字符串类型,包含错误信息) private static final OutputTag<String> ERROR_TAG = new OutputTag<String>("errors") {}; // 2. 大额订单侧输出流(保持原始订单数据类型) private static final OutputTag<OrderEvent> BIG_AMOUNT_TAG = new OutputTag<OrderEvent>("big-amount-orders") {}; public static void main(String[] args) throws Exception { // 1. 创建流处理环境 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 2. 创建模拟数据源(实际应用中可以从Kafka、Socket等读取) DataStream<String> sourceStream = env.fromElements( "order1,100.0,2023-01-01 10:00:00", // 正常订单 "order2,1500.0,2023-01-01 10:01:00", // 大额订单 "order3,-50.0,2023-01-01 10:02:00", // 异常金额 "order4,200.0", // 格式错误(缺少时间字段) "order5,3000.0,2023-01-01 10:04:00" // 大额订单 ); // 3. 将字符串数据转换为OrderEvent对象 DataStream<OrderEvent> orderStream = sourceStream .map(new MapFunction<String, OrderEvent>() { @Override public OrderEvent map(String value) throws Exception { String[] parts = value.split(","); if (parts.length < 3) { throw new IllegalArgumentException("Invalid data format: " + value); } return new OrderEvent( parts[0].trim(), Double.parseDouble(parts[1].trim()), parts[2].trim() ); } }); // 4. 使用ProcessFunction处理数据并输出到侧输出流 SingleOutputStreamOperator<OrderEvent> mainStream = orderStream .process(new ProcessFunction<OrderEvent, OrderEvent>() { @Override public void processElement(OrderEvent order, Context ctx, Collector<OrderEvent> out) throws Exception { try { // 业务验证逻辑 if (order.getAmount() < 0) { throw new IllegalArgumentException("Amount cannot be negative: " + order.getAmount()); } if (order.getTimestamp() == null || order.getTimestamp().isEmpty()) { throw new IllegalArgumentException("Timestamp is required"); } // 主输出:所有验证通过的订单 out.collect(order); // 侧输出:金额大于1000的订单 if (order.getAmount() > 1000.0) { ctx.output(BIG_AMOUNT_TAG, order); } } catch (Exception e) { // 侧输出:异常信息 String errorMsg = String.format("Error processing order %s: %s", order.getOrderId(), e.getMessage()); ctx.output(ERROR_TAG, errorMsg); } } }); // 5. 从主数据流中获取侧输出流 DataStream<String> errorStream = mainStream.getSideOutput(ERROR_TAG); DataStream<OrderEvent> bigAmountStream = mainStream.getSideOutput(BIG_AMOUNT_TAG); // 6. 对各个流进行后续处理 // 主输出流处理:正常订单 mainStream.map(order -> "Normal order: " + order.getOrderId() + " - Amount: " + order.getAmount()) .print("Main Output"); // 错误流处理:异常数据 errorStream.map(error -> "ERROR: " + error) .print("Error Output"); // 大额订单流处理:监控大额交易 bigAmountStream.map(order -> "ALERT: Big amount order detected - " + order.getOrderId() + " - Amount: " + order.getAmount()) .print("Big Amount Output"); // 7. 执行任务 env.execute("Flink Side Output Example"); } /** * 订单事件数据类 */ public static class OrderEvent { private String orderId; private double amount; private String timestamp; public OrderEvent() {} public OrderEvent(String orderId, double amount, String timestamp) { this.orderId = orderId; this.amount = amount; this.timestamp = timestamp; } // Getter 和 Setter 方法 public String getOrderId() { return orderId; } public void setOrderId(String orderId) { this.orderId = orderId; } public double getAmount() { return amount; } public void setAmount(double amount) { this.amount = amount; } public String getTimestamp() { return timestamp; } public void setTimestamp(String timestamp) { this.timestamp = timestamp; } @Override public String toString() { return "OrderEvent{" + "orderId='" + orderId + '\'' + ", amount=" + amount + ", timestamp='" + timestamp + '\'' + '}'; } } }
状态管理
flink状态
待补充
指定计算结果放在哪里
flink 1.15后升级了sink V2
kafka sink:
// 推荐方式:将所有配置统一放在Properties中 Properties kafkaProps = new Properties(); kafkaProps.setProperty("bootstrap.servers", "localhost:9092"); kafkaProps.setProperty("acks", "all"); kafkaProps.setProperty("batch.size", "16384"); kafkaProps.setProperty("linger.ms", "5"); kafkaProps.setProperty("max.request.size", "1048576"); kafkaProps.setProperty("compression.type", "snappy"); // Exactly-Once相关配置 kafkaProps.setProperty("transaction.timeout.ms", "900000"); // 必须大于checkpoint间隔 // kafkaProps.setProperty("transactional.id", "不要手动设置,Flink会自动生成前缀"); KafkaSink<String> kafkaSink = KafkaSink.<String>builder() .setRecordSerializer( KafkaRecordSerializationSchema.builder() .setTopic("target-topic") .setValueSerializationSchema(new SimpleStringSchema()) .build() ) .setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE) // 或AT_LEAST_ONCE,默认交付语义 .setKafkaProducerConfig(kafkaProps) // 统一在这里设置所有配置 .build();
简化版:
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema;
import org.apache.flink.connector.kafka.sink.KafkaSink;
public class MinimalKafkaSinkExample {
public static void main(String[] args) {
// 假设有一个DataStream<String>
DataStream<String> stream = ...;
// 最简化的Kafka Sink配置
KafkaSink<String> minimalSink = KafkaSink.<String>builder()
.setBootstrapServers("localhost:9092") // 必需:Kafka集群地址
.setRecordSerializer(
KafkaRecordSerializationSchema.builder()
.setTopic("my-topic") // 必需:目标主题
.setValueSerializationSchema(new SimpleStringSchema()) // 必需:值序列化器
.build()
)
.build(); // 注意:没有设置setDeliverGuarantee,默认就是AT_LEAST_ONCE
stream.sinkTo(minimalSink);
}
}
jdbc sink:
至少一次:
import org.apache.flink.connector.jdbc.*; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import java.sql.PreparedStatement; import java.sql.SQLException; public class AtLeastOnceMySQLSink { // POJO类 static class Order { public Long orderId; public String userId; public Double amount; public Long timestamp; public Order() {} public Order(Long orderId, String userId, Double amount, Long timestamp) { this.orderId = orderId; this.userId = userId; this.amount = amount; this.timestamp = timestamp; } } public static void main(String[] args) throws Exception { // 1. 创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 配置checkpoint(为了状态恢复,但不是Exactly-Once必须的) env.enableCheckpointing(60000); // 60秒 // 2. 创建数据源(示例) DataStream<Order> orderStream = env.fromElements( new Order(1001L, "user1", 99.9, System.currentTimeMillis()), new Order(1002L, "user2", 199.9, System.currentTimeMillis()) ); // 3. 幂等性SQL:使用ON DUPLICATE KEY UPDATE或REPLACE // 方法1:ON DUPLICATE KEY UPDATE(推荐) String sql = "INSERT INTO orders (order_id, user_id, amount, order_time) " + "VALUES (?, ?, ?, ?) " + "ON DUPLICATE KEY UPDATE " + "amount = VALUES(amount), order_time = VALUES(order_time)"; // 方法2:REPLACE INTO(会删除旧记录再插入) // String sql = "REPLACE INTO orders (order_id, user_id, amount, order_time) " + // "VALUES (?, ?, ?, ?)"; // 4. StatementBuilder JdbcStatementBuilder<Order> statementBuilder = new JdbcStatementBuilder<Order>() { @Override public void accept(PreparedStatement ps, Order order) throws SQLException { ps.setLong(1, order.orderId); ps.setString(2, order.userId); ps.setDouble(3, order.amount); ps.setTimestamp(4, new java.sql.Timestamp(order.timestamp)); } }; // 5. 执行选项:批量写入优化 JdbcExecutionOptions executionOptions = JdbcExecutionOptions.builder() .withBatchSize(1000) // 批处理大小 .withBatchIntervalMs(200) // 批处理间隔(毫秒) .withMaxRetries(3) // 最大重试次数 .build(); // 6. 连接配置 JdbcConnectionOptions connectionOptions = new JdbcConnectionOptions.JdbcConnectionOptionsBuilder() .withUrl("jdbc:mysql://localhost:3306/flink_test") .withDriverName("com.mysql.cj.jdbc.Driver") .withUsername("root") .withPassword("password") // 连接池参数(可选) .withConnectionCheckTimeoutSeconds(60) .build(); // 7. 创建JDBC Sink org.apache.flink.api.connector.sink2.Sink<Order> jdbcSink = JdbcSink.sink( sql, statementBuilder, executionOptions, connectionOptions ); // 8. 添加Sink orderStream.sinkTo(jdbcSink); // 9. 执行 env.execute("At-Least-Once MySQL Sink Example"); } }
通过数据库层面控制重复写入:
CREATE TABLE orders ( order_id BIGINT PRIMARY KEY, -- 必须有主键或唯一索引 user_id VARCHAR(50), amount DECIMAL(10, 2), order_time TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP );
否则需要自己实现两阶段提交,目前还没有官方支持。
最简单的打印调试:
import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.connector.base.DeliveryGuarantee; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.streaming.api.functions.sink.PrintSink; public class PrintSinkExample { public static void main(String[] args) throws Exception { DataStream<String> stream = ...; // 方法一:使用新的PrintSink(推荐,符合Sink V2 API) // 构建一个PrintSink实例 org.apache.flink.api.connector.sink2.Sink<String> printSink = PrintSink.<String>builder() // .setStandardErr() // 默认是标准输出,可以设置为标准错误 .build(); stream.sinkTo(printSink); // 方法二:旧的方法(仍可用,但将来可能被标记为废弃) // stream.print(); } }
触发程序执行
// 启动执行 env.execute();
如果是阿里云flink,其他参数可以在部署页面进行设置。

浙公网安备 33010602011771号