Flink实战-Data Source
Flink-Data Source
作为数据流的起点,大数据计算会使用各种各样的数据源,Flink除了支持常用的如文件、数据库、队列、Socket之外,还提供了自定义的方式让用户自由的连接数据源。

内置数据源
集合
从集合中读取数据:
env.fromElements("apple","orange","banana","peach","grape"); //从元素中读取,所有元素必须是同一类型
env.fromCollection(new ArrayList<>()); //从集合中读取
env.fromParallelCollection(new NumberSequenceIterator(1, 5), Long.class); //并行从集合中读取
env.generateSequence(0, 100); //基于给定的序列区间构建
文件
从文件中读取数据:
source = env.readTextFile("D:\\log.txt"); //读取文件
env.readFile(new TextInputFormat(new Path("D:\\log.txt")),
"D:\\log.txt",
FileProcessingMode.PROCESS_ONCE,
1,
BasicTypeInfo.STRING_TYPE_INFO).print(); //从文件中读取
Socket
从Socket中读取数据:
env.socketTextStream("127.0.0.1", 8080, "\n", 5).print(); //基于Socket的数据流,数据分隔符为\n
三方数据源
对于一些常用的三方数据源,官方也提供了额外的连接器。
Kafka Source
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>1.1.2-1.19</version>
</dependency>
使用方法:
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers(brokers)
.setTopics("input-topic")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");
MongoDB连接器
添加Maven依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-mongodb</artifactId>
<version>1.1.2-1.19</version>
</dependency>
使用方法:
MongoSource<String> source = MongoSource.<String>builder()
.setUri("mongodb://user:password@127.0.0.1:27017")
.setDatabase("my_db")
.setCollection("my_coll")
.setProjectedFields("_id", "f0", "f1")
.setFetchSize(2048)
.setLimit(10000)
.setNoCursorTimeout(true)
.setPartitionStrategy(PartitionStrategy.SAMPLE)
.setPartitionSize(MemorySize.ofMebiBytes(64))
.setSamplesPerPartition(10)
.setDeserializationSchema(new MongoDeserializationSchema<String>() {
@Override
public String deserialize(BsonDocument document) {
return document.toJson();
}
@Override
public TypeInformation<String> getProducedType() {
return BasicTypeInfo.STRING_TYPE_INFO;
}
})
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "MongoDB-Source")
.setParallelism(2)
.print()
.setParallelism(1);
RabbitMQ
添加Maven依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-rabbitmq_2.12</artifactId>
<version>1.12.2</version> <!-- 请根据你的 Flink 版本选择合适的版本号 -->
</dependency>
官网的链接示例:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// checkpointing is required for exactly-once or at-least-once guarantees
env.enableCheckpointing(...);
final RMQConnectionConfig connectionConfig = new RMQConnectionConfig.Builder()
.setHost("localhost")
.setPort(5000)
...
.build();
final DataStream<String> stream = env
.addSource(new RMQSource<String>(
connectionConfig, // config for the RabbitMQ connection
"queueName", // name of the RabbitMQ queue to consume
true, // use correlation ids; can be false if only at-least-once is required
new SimpleStringSchema())) // deserialization schema to turn messages into Java objects
.setParallelism(1); // non-parallel source is only required for exactly-once
数据库
以链接Mysql为例:
<dependencies>
<!-- Flink 与 MySQL 连接器依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_2.11</artifactId>
<version>1.13.0</version> <!-- 请使用适合你的Flink版本号 -->
</dependency>
<!-- MySQL JDBC 驱动依赖 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.23</version> <!-- 请使用适合你的MySQL驱动版本号 -->
</dependency>
<!-- 其他依赖 -->
</dependencies>
官网示例:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromElements(...)
.addSink(JdbcSink.sink(
"insert into books (id, title, author, price, qty) values (?,?,?,?,?)",
(ps, t) -> {
ps.setInt(1, t.id);
ps.setString(2, t.title);
ps.setString(3, t.author);
ps.setDouble(4, t.price);
ps.setInt(5, t.qty);
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl(getDbMetadata().getUrl())
.withDriverName(getDbMetadata().getDriverClass())
.build()));
env.execute();
除此之外,Flink还提供了如Firehose、Elasticsearch、Cassandra等链接器。
自定义数据源
Flink的DataStream API可以让开发者根据实际需要,灵活的自定义Source,本质上就是定义一个类,实现SourceFunction接口的run方法和cancel方法,run方法的实现就是获取数据的逻辑,如果你需要实现并行的Source,则推荐实现ParallelSourceFunction或继承RichParallelSourceFunction接口。
下面我们将实现的一个自定义的MysqlSource示例:
package em.im.cbd.dao;
import em.im.cbd.bean.PostContent;
import em.im.cbd.common.JsonUtil;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.Map;
public class MySqlSource extends RichSourceFunction<PostContent> {
PreparedStatement ps;
private Connection connection;
@Override
public void run(SourceContext<MsgInfo> sourceContext) throws Exception {
ResultSet resultSet = ps.executeQuery();
while (resultSet.next()) {
MsgInfo info = new MsgInfo(
resultSet.getInt("id"),
resultSet.getString("msg"),
resultSet.getString("uid"),
resultSet.getString("nick_name"),
resultSet.getString("ip"),
resultSet.getInt("content_length"));
sourceContext.collect(info); //调用SourceContext的collect方法搜集数据,返回DataStreamSource
}
}
@Override
public void cancel() {
}
/**
* 在open()方法中建立连接,这样不用每次invoke的时候都建立和释放连接。
*
* @param parameters
* @throws Exception
*/
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
String url = "jdbc:mysql://127.0.0.1:3306/implat_chat?user=root&password=123456&useUnicode=true&characterEncoding=UTF8";
connection = DriverManager.getConnection(url);
String sql = "select * from chat_msg";
ps = this.connection.prepareStatement(sql);
}
@Override
public void close() throws Exception {
super.close();
if (connection != null)
connection.close();
if (ps != null)
ps.close();
}
}

浙公网安备 33010602011771号