Flink实时计算项目总结

0. 项目计划

1)完善flink的原理 √

2)搭建实时计算集群 √

  • flume √
  • kafka √
  • flink √
  • hbase√
  • mysql√
  • hadoop√

3)本地编写调试flink程序 √

4)打包、上线flink程序到正式,并监控运行过程√

5)实验容错机制,flume采集数据失败后,如何恢复,下游数据如何恢复计算并保证正确

6)搭建hadoop,hbase,替换mysql,搭建rocksdb存储state并查看数据,存储checkpoint到hdfs并查看

7)将flinkSQL改编为table api,在改编为steam api

1. 项目架构

  • 数据产生:python爬虫爬取,存储到文件中
  • 数据采集:使用flume读取文件,采集到kafka中
  • 数据加工:使用flink对kafka数据进行加工,mysql或者hbase或者es存储维表
  • 数据输出:数据输出到ck,mysql,doris,starroks,druid等数据,进行可视化展示

image-20240423101601909

2. 环境搭建

2.1 数据准备

## 使用python脚本生成数据
python autodatapython.py D:\wk\personal_secondhands_houses_realtime_anaysis\test01.txt 100
python autodatapython.py D:\wk\personal_secondhands_houses_realtime_anaysis\test02.txt 100

## 注意每一次文件名称需要变化,不然flume会报错

2.2 搭建启动zk

2.3 搭建启动kafka

2.4 搭建启动flume

2.5 搭建mysql

初始化表:

---创建商品信息维度表
DROP TABLE IF EXISTS hcip_goods_info;
CREATE TABLE hcip_goods_info
( goods_no varchar(30) NOT NULL
, goods_name varchar(30) DEFAULT NULL ) 
ENGINE=InnoDB 
DEFAULT
CHARSET=utf8; 

——插入商品信息样例数据
insert into hcip_goods_info values('220902','杭州丝绸');
insert into hcip_goods_info values('430031','西湖龙井');
insert into hcip_goods_info values('550012','西湖莼菜');
insert into hcip_goods_info values('650012','张小泉剪刀');
insert into hcip_goods_info values('532120','塘栖枇杷');
insert into hcip_goods_info values('230121','临安山核桃');
insert into hcip_goods_info values('250983','西湖藕粉');
insert into hcip_goods_info values('480071','千岛湖鱼干');
insert into hcip_goods_info values('580016','天尊贡芽');
insert into hcip_goods_info values('950013','叫花童鸡');
insert into hcip_goods_info values('152121','火腿蚕豆');
insert into hcip_goods_info values('230121','杭州百鸟朝凤');
commit;
---创建门店信息纬度表
DROP TABLE IF EXISTS hcip_store_info; 
CREATE TABLE hcip_store_info
(store_id varchar(50) NOT NULL
,store_name varchar(50)DEFAULT NULL)
ENGINE=InnoDB 
DEFAULT 
CHARSET=utf8; 

---插入商品信息样例数据
INSERT INTO hcip_store_info VALUES('313012','莫干山店'); 
INSERT INTO hcip_store_info VALUES('313013','定安路店');  
INSERT INTO hcip_store_info VALUES('313014','西湖银泰店');  
INSERT INTO hcip_store_info VALUES('313015','天目山店');  
INSERT INTO hcip_store_info VALUES('313016','凤起路店');  
INSERT INTO hcip_store_info VALUES('313017','南山路店');  
INSERT INTO hcip_store_info VALUES('313018','西溪湿地店');  
INSERT INTO hcip_store_info VALUES('33019' ,'传媒学院店');  
INSERT INTO hcip_store_info VALUES('313020','西湖断桥店');  
INSERT INTO hcip_store_info VALUES('313021','保淑塔店');  
INSERT INTO hcip_store_info VALUES('313022','南宋御街店');  
INSERT INTO hcip_store_info VALUES('313023','河坊街店'); 
commit;
--创建商品总销售额表
--goods_amount_count 
DROP TABLE IF EXISTS goods_amount_count; 
CREATE TABLE goods_amount_count
(amount_total float NOT NULL 
,sale_date date PRIMARY KEY) 
ENGINE=InnoDB 
DEFAULT 
CHARSET=utf8;

--创建销售总额前5的门店排行表
--amount_store_rank 
DROP TABLE IF EXISTS amount_store_rank; 
CREATE TABLE amount_store_rank 
( store_id varchar(50) PRIMARY KEY
,store_name varchar(50) DEFAULT NULL
,amount_total float DEFAULT NULL )
ENGINE=InnoDB 
DEFAULT 
CHARSET=utf8; 
--- 设置显示结果的模式
SET 'sql-client.execution.result-mode' = 'tableau'; 

---kafka输入流---
CREATE TABLE kafka_source 
(id STRING,use_rname STRING, age int,gender STRING,goods_no STRING
,goods_price Float,store_id STRING,shopping_type STRING,tel STRING
,email STRING,shopping_date Date)
WITH ('connector'='kafka'
,'topic'='real_time'
,'properties.bootstrap.servers'='localhost:9092'
,'properties.group.id'='test-consumer-group'
,'scan.startup.mode' = 'latest-offset'								
,'format'='csv'); 

---mysql输入流:1. 门店维度信息表---
CREATE TABLE table_store_info
( store_id STRING,
store_name STRING)
WITH( 
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'hcip_store_info'); 

---mysql输入流:2. 商品维度信息表---
CREATE TABLE table_goods_info 
( goods_no STRING,
goods_name STRING )
WITH( 
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'hcip_goods_info;'); 


---数据输出到mysql:1 商品总的销售额(按天统计)---
CREATE TABLE goods_amount_count
(amount_total Float
,sale_date date
,PRIMARY KEY (sale_date) NOT ENFORCED)
WITH (
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'goods_amount_count'); 

---数据输出到mysql:2—销售总额前5的门店排行---
CREATE TABLE amount_store_rank
(store_id STRING
,store_name STRING
,amount_total Float
,PRIMARY KEY (store_id) NOT ENFORCED)
WITH (
'connector'='jdbc'
,'username'='root'
,'password'='123456'
,'url' = 'jdbc:mysql://localhost:3306/real_time'
,'table-name'= 'amount_store_rank'); 

---插入数据 1 商品总的销售额数据插入---
INSERT INTO goods_amount_count
SELECT sum(goods_price) as amount_total
,shopping_date as sale_date 
FROM kafka_source 
WHERE shopping_type='buy' 
group BY shopping_date;


---插入数据 2—销售总额前5的门店排行数据插入---
INSERT INTO amount_store_rank
SELECT t1.store_id,t2.store_name
,sum(t1.goods_price) as amount_total 
FROM kafka_source t1 
left join table_store_info as t2 on t1.store_id=t2.store_id 
WHERE t1.shopping_type='buy' 
group by t1.store_id,t2.store_name;
  • 启动客户端./sql-client.sh embedded -i ../conf/sql-client-init.sql ## -s yarn-session

2.7 搭建Hadoop

参考: https://www.cnblogs.com/fushiyi/articles/18142929 -- 搭建hadoop

2.8 搭建Hbase

参考:https://www.cnblogs.com/fushiyi/articles/18143068

2.9 常见问题

2.9.1 依赖确实导致缺少类

  • 解决办法:将缺失的jar包下载,复制到flinkhome/lib下

  • maven官方搜索jar包:https://mvnrepository.com/

    • <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc</artifactId>   ## 搜索就行
            <version>3.1.2-1.17</version>                  ## 然后选择版本
          </dependency>
      
  • flinksql客户端依赖关系:官方文档参考https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/connectors/table/jdbc/

  • ---依赖关系
    flink连接kafka数据源: flink-connector-kafka-1.17.0.jar,flink-connector-base-1.17.0.jar,kafka-clients-3.4.0.jar(kafkahome/lib复制而来)
    flink连接mysql数据源:flink-connector-jdbc-3.1.0-1.17.jar,mysql-connector-java-5.1.47.jar(添加后需重启flink集群才生效)
    
  • ---问题报错
    org.apache.kafka.clients.consumer.OffsetResetStrategy   ---缺少kafka-clinet包,缺少flink-connector-kafka包
    Could not find any factory for identifier 'jdbc'    ---缺少flink-connector-jdbc-3.1.0-1.17.jar,mysql-connector-java-5.1.47.jar
    

2.9.2 没有重启flink集群

  • 解决办法:重启flink集群
java.net.ConnectException: 拒绝连接                  ---没有启动flink集群
修改配置后没有正确加载                            		---没有启动flink集群

2.9.3 flume不能连续采集变更数据

  • 问题:flume不能连续采集,每一次生成脚本会生成.txt.COMPLETED文件,不能再生成

    • 解决办法:python生成文件不能同名

    • python autodatapython.py /tmp/flume_spooldir/test01.txt 10
      python autodatapython.py /tmp/flume_spooldir/test02.txt 10
      python autodatapython.py /tmp/flume_spooldir/test03.txt 10
      python autodatapython.py /tmp/flume_spooldir/test04.txt 10
      
    • 分析:说明flume只能监听不同的文件,并不能监听一个文件中内容的修改

2.9.4 读取数据源,写入数据失败

  • 问题:数据类型不匹配

    • Query schema: [store_id: STRING, store_name: STRING, amount_total: FLOAT]
      Sink schema:  [store_id: INT, store_name: STRING]
      
    • 修改对应表的类型

  • 问题:写出数据时,建表不对

  • 问题:资源不足

    • Could not acquire the minimum required resources.
      
    • 修改conf/flink-conf.yaml: jobmanager.memory.process.size: 16000m;并重启集群

3. Flink编程、打包、上线

3.1 创建项目

## jdk可以自己选择自己的jdk目录

image-20240412164623618

3.2 pom文件配置

参考:https://gitee.com/fubob/personal_secondhands_houses_realtime_anaysis/blob/master/real_time01/pom.xml

3.3 Flink编程

参考:https://gitee.com/fubob/personal_secondhands_houses_realtime_anaysis/tree/master/real_time01

3.4 Flink打包

3.4.1 依赖对比

## 1. 对比pom文件依赖 与 进去/opt/hadoop/flink-1.17.0/lib的jar包版本是否匹配
## 2. 根据情况,修改scope的范围

3.4.2 插件打包

  1. pom文件加入maven-compiler-plugin插件和maven-shade-plugin插件
    <build>
        <!-- 项目最终打包成的名字 -->
        <finalName>test</finalName>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <!-- 指定maven编译的jdk版本,如果不指定,maven3默认用jdk 1.5 maven2默认用jdk1.3 -->
                    <source>1.8</source> <!-- 源代码使用的JDK版本 -->
                    <target>1.8</target> <!-- 需要生成的目标class文件的编译版本 -->
                    <encoding>UTF-8</encoding><!-- 字符集编码 -->
                </configuration>
            </plugin>
 
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <!-- 指定启动类 -->
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.cw.HelloStart</mainClass>
                                </transformer>
 
                                <!-- 下面的配置仅针对存在同名资源文件的情况,如没有则不用配置-->
                                <!-- 有些项目包可能会包含同文件名的资源文件(例如属性文件)-->
                                <!-- 为避免覆盖,可以将它们的内容合并到一个文件中 -->
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.handlers</resource>
                                </transformer>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.schemas</resource>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
  1. 运行maven clean compile package

image-20240416164246212

  1. 查看生成的jar包

img

3.5 集群重启

参考:https://www.cnblogs.com/fushiyi/articles/18141549 -- 1. 集群重启

3.6 Flink部署

**将jar包上传到Flink集群**

3.7 Flink运行

./flink run -c org.example.real_time_v20240412 real_time/real_time.jar

# 后台运行
nohup ./flink run -c org.example.real_time_v20240412 real_time/real_time.jar &

3.8 生产数据脚本上线

#!/bin/bash  

# 使用for循环执行50次  
for ((i=300; i<=1000; i++))
do
    pythonScript="python /opt/hadoop/autodatapython.py  /tmp/flume_spooldir/test$i.txt 10"
    $pythonScript
    sleep 60
done
## 后台执行脚本
nohup ./dataGenerate.sh &

4. Flink监控

4.1 监控页面

## 网址:http://192.168.0.104:8081/#/job/running

4.1.1 首页介绍

image-20240416164633782

4.1.2 数据反压跟踪⭐

image-20240417094432935

4.1.3 数据倾斜跟踪⭐

image-20240417100809703

4.2 Flink的容错与恢复⭐

参考: https://www.cnblogs.com/fushiyi/articles/18152410 -- 2. 容错恢复

4.3 常见问题⭐⭐

- 问题:重复文本,导致flume错误,flink程序崩掉
- 解决:清除源文件,重启flume agent,重启flink程序
数据反压问题⭐⭐:https://www.cnblogs.com/fushiyi/articles/18152410 -- 1.2.1 数据反压问题
数据倾斜问题⭐⭐: https://www.cnblogs.com/fushiyi/articles/18152410 -- 1.2.2 数据倾斜问题
posted @ 2024-04-23 11:20  付十一。  阅读(93)  评论(0)    收藏  举报