Flink实时计算项目总结
0. 项目计划
1)完善flink的原理 √
2)搭建实时计算集群 √
- flume √
- kafka √
- flink √
- hbase√
- mysql√
- hadoop√
3)本地编写调试flink程序 √
4)打包、上线flink程序到正式,并监控运行过程√
5)实验容错机制,flume采集数据失败后,如何恢复,下游数据如何恢复计算并保证正确
6)搭建hadoop,hbase,替换mysql,搭建rocksdb存储state并查看数据,存储checkpoint到hdfs并查看
7)将flinkSQL改编为table api,在改编为steam api
1. 项目架构
- 数据产生:python爬虫爬取,存储到文件中
- 数据采集:使用flume读取文件,采集到kafka中
- 数据加工:使用flink对kafka数据进行加工,mysql或者hbase或者es存储维表
- 数据输出:数据输出到ck,mysql,doris,starroks,druid等数据,进行可视化展示
2. 环境搭建
2.1 数据准备
## 使用python脚本生成数据
python autodatapython.py D:\wk\personal_secondhands_houses_realtime_anaysis\test01.txt 100
python autodatapython.py D:\wk\personal_secondhands_houses_realtime_anaysis\test02.txt 100
## 注意每一次文件名称需要变化,不然flume会报错
2.2 搭建启动zk
- 搭建启动zk:https://www.cnblogs.com/fushiyi/articles/18141514 -- 1. 部署zookeeper集群
2.3 搭建启动kafka
- 搭建启动kafka:https://www.cnblogs.com/fushiyi/articles/18141514 -- 2. 部署Kafka集群
2.4 搭建启动flume
- 搭建启动flume:https://www.cnblogs.com/fushiyi/articles/18141514 -- 3. 部署Flume
2.5 搭建mysql
- 搭建mysql数据库:https://www.cnblogs.com/fushiyi/articles/15956398.html -- 2.7 2.7 部署MYSQL(主机)
初始化表:
---创建商品信息维度表
DROP TABLE IF EXISTS hcip_goods_info;
CREATE TABLE hcip_goods_info
( goods_no varchar(30) NOT NULL
, goods_name varchar(30) DEFAULT NULL )
ENGINE=InnoDB
DEFAULT
CHARSET=utf8;
——插入商品信息样例数据
insert into hcip_goods_info values('220902','杭州丝绸');
insert into hcip_goods_info values('430031','西湖龙井');
insert into hcip_goods_info values('550012','西湖莼菜');
insert into hcip_goods_info values('650012','张小泉剪刀');
insert into hcip_goods_info values('532120','塘栖枇杷');
insert into hcip_goods_info values('230121','临安山核桃');
insert into hcip_goods_info values('250983','西湖藕粉');
insert into hcip_goods_info values('480071','千岛湖鱼干');
insert into hcip_goods_info values('580016','天尊贡芽');
insert into hcip_goods_info values('950013','叫花童鸡');
insert into hcip_goods_info values('152121','火腿蚕豆');
insert into hcip_goods_info values('230121','杭州百鸟朝凤');
commit;
---创建门店信息纬度表
DROP TABLE IF EXISTS hcip_store_info;
CREATE TABLE hcip_store_info
(store_id varchar(50) NOT NULL
,store_name varchar(50)DEFAULT NULL)
ENGINE=InnoDB
DEFAULT
CHARSET=utf8;
---插入商品信息样例数据
INSERT INTO hcip_store_info VALUES('313012','莫干山店');
INSERT INTO hcip_store_info VALUES('313013','定安路店');
INSERT INTO hcip_store_info VALUES('313014','西湖银泰店');
INSERT INTO hcip_store_info VALUES('313015','天目山店');
INSERT INTO hcip_store_info VALUES('313016','凤起路店');
INSERT INTO hcip_store_info VALUES('313017','南山路店');
INSERT INTO hcip_store_info VALUES('313018','西溪湿地店');
INSERT INTO hcip_store_info VALUES('33019' ,'传媒学院店');
INSERT INTO hcip_store_info VALUES('313020','西湖断桥店');
INSERT INTO hcip_store_info VALUES('313021','保淑塔店');
INSERT INTO hcip_store_info VALUES('313022','南宋御街店');
INSERT INTO hcip_store_info VALUES('313023','河坊街店');
commit;
--创建商品总销售额表
--goods_amount_count
DROP TABLE IF EXISTS goods_amount_count;
CREATE TABLE goods_amount_count
(amount_total float NOT NULL
,sale_date date PRIMARY KEY)
ENGINE=InnoDB
DEFAULT
CHARSET=utf8;
--创建销售总额前5的门店排行表
--amount_store_rank
DROP TABLE IF EXISTS amount_store_rank;
CREATE TABLE amount_store_rank
( store_id varchar(50) PRIMARY KEY
,store_name varchar(50) DEFAULT NULL
,amount_total float DEFAULT NULL )
ENGINE=InnoDB
DEFAULT
CHARSET=utf8;
2.6 搭建启动flink
-
参考:https://www.cnblogs.com/fushiyi/articles/18141514 -- 5. 部署Flink
-
flinksql客户端初始化sql
--- 设置显示结果的模式
SET 'sql-client.execution.result-mode' = 'tableau';
---kafka输入流---
CREATE TABLE kafka_source
(id STRING,use_rname STRING, age int,gender STRING,goods_no STRING
,goods_price Float,store_id STRING,shopping_type STRING,tel STRING
,email STRING,shopping_date Date)
WITH ('connector'='kafka'
,'topic'='real_time'
,'properties.bootstrap.servers'='localhost:9092'
,'properties.group.id'='test-consumer-group'
,'scan.startup.mode' = 'latest-offset'
,'format'='csv');
---mysql输入流:1. 门店维度信息表---
CREATE TABLE table_store_info
( store_id STRING,
store_name STRING)
WITH(
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'hcip_store_info');
---mysql输入流:2. 商品维度信息表---
CREATE TABLE table_goods_info
( goods_no STRING,
goods_name STRING )
WITH(
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'hcip_goods_info;');
---数据输出到mysql:1 商品总的销售额(按天统计)---
CREATE TABLE goods_amount_count
(amount_total Float
,sale_date date
,PRIMARY KEY (sale_date) NOT ENFORCED)
WITH (
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'goods_amount_count');
---数据输出到mysql:2—销售总额前5的门店排行---
CREATE TABLE amount_store_rank
(store_id STRING
,store_name STRING
,amount_total Float
,PRIMARY KEY (store_id) NOT ENFORCED)
WITH (
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'amount_store_rank');
---插入数据 1 商品总的销售额数据插入---
INSERT INTO goods_amount_count
SELECT sum(goods_price) as amount_total
,shopping_date as sale_date
FROM kafka_source
WHERE shopping_type='buy'
group BY shopping_date;
---插入数据 2—销售总额前5的门店排行数据插入---
INSERT INTO amount_store_rank
SELECT t1.store_id,t2.store_name
,sum(t1.goods_price) as amount_total
FROM kafka_source t1
left join table_store_info as t2 on t1.store_id=t2.store_id
WHERE t1.shopping_type='buy'
group by t1.store_id,t2.store_name;
- 启动客户端:
./sql-client.sh embedded -i ../conf/sql-client-init.sql ## -s yarn-session
2.7 搭建Hadoop
参考: https://www.cnblogs.com/fushiyi/articles/18142929 -- 搭建hadoop
2.8 搭建Hbase
参考:https://www.cnblogs.com/fushiyi/articles/18143068
2.9 常见问题
2.9.1 依赖确实导致缺少类
-
解决办法:将缺失的jar包下载,复制到flinkhome/lib下
-
maven官方搜索jar包:https://mvnrepository.com/
-
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-jdbc</artifactId> ## 搜索就行 <version>3.1.2-1.17</version> ## 然后选择版本 </dependency>
-
-
flinksql客户端依赖关系:官方文档参考https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/connectors/table/jdbc/
-
---依赖关系 flink连接kafka数据源: flink-connector-kafka-1.17.0.jar,flink-connector-base-1.17.0.jar,kafka-clients-3.4.0.jar(kafkahome/lib复制而来) flink连接mysql数据源:flink-connector-jdbc-3.1.0-1.17.jar,mysql-connector-java-5.1.47.jar(添加后需重启flink集群才生效)
-
---问题报错 org.apache.kafka.clients.consumer.OffsetResetStrategy ---缺少kafka-clinet包,缺少flink-connector-kafka包 Could not find any factory for identifier 'jdbc' ---缺少flink-connector-jdbc-3.1.0-1.17.jar,mysql-connector-java-5.1.47.jar
2.9.2 没有重启flink集群
- 解决办法:重启flink集群
java.net.ConnectException: 拒绝连接 ---没有启动flink集群
修改配置后没有正确加载 ---没有启动flink集群
2.9.3 flume不能连续采集变更数据
-
问题:flume不能连续采集,每一次生成脚本会生成.txt.COMPLETED文件,不能再生成
-
解决办法:python生成文件不能同名
-
python autodatapython.py /tmp/flume_spooldir/test01.txt 10 python autodatapython.py /tmp/flume_spooldir/test02.txt 10 python autodatapython.py /tmp/flume_spooldir/test03.txt 10 python autodatapython.py /tmp/flume_spooldir/test04.txt 10
-
分析:说明flume只能监听不同的文件,并不能监听一个文件中内容的修改
-
2.9.4 读取数据源,写入数据失败
-
问题:数据类型不匹配
-
Query schema: [store_id: STRING, store_name: STRING, amount_total: FLOAT] Sink schema: [store_id: INT, store_name: STRING]
-
修改对应表的类型
-
-
问题:写出数据时,建表不对
-
Unsupported options: primary-key
-
参考初始化sql,输出表的建表语句
-
参考官方文档:https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/connectors/table/jdbc/
-
-
问题:资源不足
-
Could not acquire the minimum required resources.
-
修改conf/flink-conf.yaml: jobmanager.memory.process.size: 16000m;并重启集群
-
3. Flink编程、打包、上线
3.1 创建项目
## jdk可以自己选择自己的jdk目录
3.2 pom文件配置
参考:https://gitee.com/fubob/personal_secondhands_houses_realtime_anaysis/blob/master/real_time01/pom.xml
3.3 Flink编程
参考:https://gitee.com/fubob/personal_secondhands_houses_realtime_anaysis/tree/master/real_time01
3.4 Flink打包
3.4.1 依赖对比
## 1. 对比pom文件依赖 与 进去/opt/hadoop/flink-1.17.0/lib的jar包版本是否匹配
## 2. 根据情况,修改scope的范围
3.4.2 插件打包
- pom文件加入maven-compiler-plugin插件和maven-shade-plugin插件
<build>
<!-- 项目最终打包成的名字 -->
<finalName>test</finalName>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<!-- 指定maven编译的jdk版本,如果不指定,maven3默认用jdk 1.5 maven2默认用jdk1.3 -->
<source>1.8</source> <!-- 源代码使用的JDK版本 -->
<target>1.8</target> <!-- 需要生成的目标class文件的编译版本 -->
<encoding>UTF-8</encoding><!-- 字符集编码 -->
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- 指定启动类 -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.cw.HelloStart</mainClass>
</transformer>
<!-- 下面的配置仅针对存在同名资源文件的情况,如没有则不用配置-->
<!-- 有些项目包可能会包含同文件名的资源文件(例如属性文件)-->
<!-- 为避免覆盖,可以将它们的内容合并到一个文件中 -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.handlers</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.schemas</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
- 运行maven clean compile package
- 查看生成的jar包
3.5 集群重启
参考:https://www.cnblogs.com/fushiyi/articles/18141549 -- 1. 集群重启
3.6 Flink部署
**将jar包上传到Flink集群**
3.7 Flink运行
./flink run -c org.example.real_time_v20240412 real_time/real_time.jar
# 后台运行
nohup ./flink run -c org.example.real_time_v20240412 real_time/real_time.jar &
3.8 生产数据脚本上线
#!/bin/bash
# 使用for循环执行50次
for ((i=300; i<=1000; i++))
do
pythonScript="python /opt/hadoop/autodatapython.py /tmp/flume_spooldir/test$i.txt 10"
$pythonScript
sleep 60
done
## 后台执行脚本
nohup ./dataGenerate.sh &
4. Flink监控
4.1 监控页面
## 网址:http://192.168.0.104:8081/#/job/running
4.1.1 首页介绍
4.1.2 数据反压跟踪⭐
4.1.3 数据倾斜跟踪⭐
4.2 Flink的容错与恢复⭐
参考: https://www.cnblogs.com/fushiyi/articles/18152410 -- 2. 容错恢复
4.3 常见问题⭐⭐
- 问题:重复文本,导致flume错误,flink程序崩掉
- 解决:清除源文件,重启flume agent,重启flink程序
数据反压问题⭐⭐:https://www.cnblogs.com/fushiyi/articles/18152410 -- 1.2.1 数据反压问题
数据倾斜问题⭐⭐: https://www.cnblogs.com/fushiyi/articles/18152410 -- 1.2.2 数据倾斜问题