Flink实战-Data Source

Flink-Data Source

作为数据流的起点,大数据计算会使用各种各样的数据源,Flink除了支持常用的如文件、数据库、队列、Socket之外,还提供了自定义的方式让用户自由的连接数据源。

内置数据源

集合

从集合中读取数据:

  env.fromElements("apple","orange","banana","peach","grape"); //从元素中读取,所有元素必须是同一类型
  env.fromCollection(new ArrayList<>()); //从集合中读取
  env.fromParallelCollection(new NumberSequenceIterator(1, 5), Long.class); //并行从集合中读取
  env.generateSequence(0, 100); //基于给定的序列区间构建

文件

从文件中读取数据:

source = env.readTextFile("D:\\log.txt"); //读取文件
env.readFile(new TextInputFormat(new Path("D:\\log.txt")),
        "D:\\log.txt",
        FileProcessingMode.PROCESS_ONCE,
        1,
        BasicTypeInfo.STRING_TYPE_INFO).print(); //从文件中读取

Socket

从Socket中读取数据:

env.socketTextStream("127.0.0.1", 8080, "\n", 5).print(); //基于Socket的数据流,数据分隔符为\n

三方数据源

对于一些常用的三方数据源,官方也提供了额外的连接器。

Kafka Source

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.12</artifactId>
    <version>1.1.2-1.19</version>
</dependency>

使用方法:

KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers(brokers)
    .setTopics("input-topic")
    .setGroupId("my-group")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setValueOnlyDeserializer(new SimpleStringSchema())
    .build();

env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");

MongoDB连接器

添加Maven依赖:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-mongodb</artifactId>
    <version>1.1.2-1.19</version>
</dependency>

使用方法:

MongoSource<String> source = MongoSource.<String>builder()
        .setUri("mongodb://user:password@127.0.0.1:27017")
        .setDatabase("my_db")
        .setCollection("my_coll")
        .setProjectedFields("_id", "f0", "f1")
        .setFetchSize(2048)
        .setLimit(10000)
        .setNoCursorTimeout(true)
        .setPartitionStrategy(PartitionStrategy.SAMPLE)
        .setPartitionSize(MemorySize.ofMebiBytes(64))
        .setSamplesPerPartition(10)
        .setDeserializationSchema(new MongoDeserializationSchema<String>() {
            @Override
            public String deserialize(BsonDocument document) {
                return document.toJson();
            }

            @Override
            public TypeInformation<String> getProducedType() {
                return BasicTypeInfo.STRING_TYPE_INFO;
            }
        })
        .build();

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.fromSource(source, WatermarkStrategy.noWatermarks(), "MongoDB-Source")
        .setParallelism(2)
        .print()
        .setParallelism(1);

RabbitMQ

添加Maven依赖:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-rabbitmq_2.12</artifactId>
    <version>1.12.2</version> <!-- 请根据你的 Flink 版本选择合适的版本号 -->
</dependency>

官网的链接示例:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// checkpointing is required for exactly-once or at-least-once guarantees
env.enableCheckpointing(...);

final RMQConnectionConfig connectionConfig = new RMQConnectionConfig.Builder()
    .setHost("localhost")
    .setPort(5000)
    ...
    .build();
    
final DataStream<String> stream = env
    .addSource(new RMQSource<String>(
        connectionConfig,            // config for the RabbitMQ connection
        "queueName",                 // name of the RabbitMQ queue to consume
        true,                        // use correlation ids; can be false if only at-least-once is required
        new SimpleStringSchema()))   // deserialization schema to turn messages into Java objects
    .setParallelism(1);              // non-parallel source is only required for exactly-once

数据库

以链接Mysql为例:

<dependencies>
    <!-- Flink 与 MySQL 连接器依赖 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-jdbc_2.11</artifactId>
        <version>1.13.0</version> <!-- 请使用适合你的Flink版本号 -->
    </dependency>
    
    <!-- MySQL JDBC 驱动依赖 -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>8.0.23</version> <!-- 请使用适合你的MySQL驱动版本号 -->
    </dependency>

    <!-- 其他依赖 -->
</dependencies>

官网示例:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
        .fromElements(...)
        .addSink(JdbcSink.sink(
                "insert into books (id, title, author, price, qty) values (?,?,?,?,?)",
                (ps, t) -> {
                    ps.setInt(1, t.id);
                    ps.setString(2, t.title);
                    ps.setString(3, t.author);
                    ps.setDouble(4, t.price);
                    ps.setInt(5, t.qty);
                },
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withUrl(getDbMetadata().getUrl())
                        .withDriverName(getDbMetadata().getDriverClass())
                        .build()));
env.execute();

除此之外,Flink还提供了如Firehose、Elasticsearch、Cassandra等链接器。

自定义数据源

Flink的DataStream API可以让开发者根据实际需要,灵活的自定义Source,本质上就是定义一个类,实现SourceFunction接口的run方法和cancel方法,run方法的实现就是获取数据的逻辑,如果你需要实现并行的Source,则推荐实现ParallelSourceFunction或继承RichParallelSourceFunction接口。
下面我们将实现的一个自定义的MysqlSource示例:

package em.im.cbd.dao;

import em.im.cbd.bean.PostContent;
import em.im.cbd.common.JsonUtil;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.Map;

public class MySqlSource extends RichSourceFunction<PostContent> {

    PreparedStatement ps;
    private Connection connection;


    @Override
    public void run(SourceContext<MsgInfo> sourceContext) throws Exception {
        ResultSet resultSet = ps.executeQuery();
        while (resultSet.next()) {
            MsgInfo info = new MsgInfo(
                    resultSet.getInt("id"),
                    resultSet.getString("msg"),
                    resultSet.getString("uid"),
                    resultSet.getString("nick_name"),
                    resultSet.getString("ip"),
                    resultSet.getInt("content_length"));
            sourceContext.collect(info); //调用SourceContext的collect方法搜集数据,返回DataStreamSource
        }
    }

    @Override
    public void cancel() {

    }

    /**
     * 在open()方法中建立连接,这样不用每次invoke的时候都建立和释放连接。
     *
     * @param parameters
     * @throws Exception
     */
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        String url = "jdbc:mysql://127.0.0.1:3306/implat_chat?user=root&password=123456&useUnicode=true&characterEncoding=UTF8";
        connection = DriverManager.getConnection(url);
        String sql = "select * from chat_msg";
        ps = this.connection.prepareStatement(sql);
    }

    @Override
    public void close() throws Exception {
        super.close();
        if (connection != null)
            connection.close();
        if (ps != null)
            ps.close();
    }
}
posted @ 2024-07-26 20:32  古法编程  阅读(52)  评论(0)    收藏  举报