LZ名約山炮

博客园 首页 新随笔 联系 订阅 管理

第1章 DWM层和DWS设计

1.1 设计思路

  DWM(Data WareHouse Middle),一般成为数据中间层,该层会在DWD层的基础上, 对数据做轻度的聚合操作,生成一系列的中间表,提升公共指标的复用性,减少重复加工。直观来讲,就是对通用的核心维度进行聚合操作,算出相应的统计指标。

  我们在之前通过分流等手段,把数据分拆成了独立的kafka topic。那么接下来如何处理数据,就要思考一下我们到底要通过实时计算出哪些指标项。因为实时计算与离线不同,实时计算的开发和运维成本都是非常高的,要结合实际情况考虑是否有必要象离线数仓一样,建一个大而全的中间层。

  如果没有必要大而全,这时候就需要大体规划一下要实时计算出的指标需求了。把这些指标以主题宽表的形式输出就是我们的DWS层。

1.2 需求梳理

统计主题

需求指标

输出方式

计算来源

来源层级

访客

pv

可视化大屏

page_log直接可求

dwd

 

uv

可视化大屏

需要用page_log过滤去重

dwm

 

跳出率

可视化大屏

需要通过page_log行为判断

dwm

 

连续访问页面数

可视化大屏

需要识别开始访问标识

dwd

 

连续访问时长

可视化大屏

需要识别开始访问标识

dwd

商品

点击

多维分析

page_log直接可求

dwd

 

收藏

多维分析

收藏表

dwd

 

加入购物车

多维分析

购物车表

dwd

 

下单

可视化大屏

订单宽表

dwm

 

支付

多维分析

支付宽表

dwm

 

退款

多维分析

退款表

dwd

 

评论

多维分析

评论表

dwd

地区

pv

多维分析

page_log直接可求

dwd

 

uv

多维分析

需要用page_log过滤去重

dwm

 

下单

可视化大屏

订单宽表

dwm

关键词

搜索关键词

可视化大屏

页面访问日志 直接可求

dwd

 

点击商品关键词

可视化大屏

商品主题下单再次聚合

dws

 

下单商品关键词

可视化大屏

商品主题下单再次聚合

dws

  当然实际需求还会有更多,这里主要以为可视化大屏为目的进行实时计算的处理。DWM层的定位是什么,DWM层主要服务DWS,因为部分需求直接从DWD层到DWS层中间会有一定的计算量,而且这部分计算的结果很有可能被多个DWS层主题复用,所以部分DWD成会形成一层DWM,我们这里主要涉及业务:

  1)访问UV计算

  2)跳出明细计算

  3)订单宽表

  4)支付宽表

第2章 DWM层: UV计算

2.1 需求分析与思路

  UV,全称是Unique Visitor,即独立访客,对于实时计算中,也可以称为DAU(Daily Active User),即每日活跃用户,因为实时计算中的uv通常是指当日的访客数

  那么如何从用户行为日志中识别出当日的访客,那么有3点:

    其一,是识别出该访客打开的第一个页面,表示这个访客开始进入我们的应用

    其二,由于访客可以在一天中多次进入应用,所以我们要在一天的范围内进行去重

    其三,如何在第二天某个用户重新对uv做贡献

2.2 去重逻辑

  1)使用event-time语义(考虑数据的乱序)

  2)按照mid分组

  3)添加窗口

  4)过滤出来当天的首次访问记录(去重)

  5)使用flink的状态, 而且状态只保留一天即可,什么时候清除状态?  现在的日期和状态中保存的日期不一致的时候清除!

  6)把当天的首次访问记录写入到dwm层(Kafka)

2.3 具体实现代码

  1)在Constant中添加常量

public static final String TOPIC_DWM_UV = "dwn_uv";

  2)DwmUvApp

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONAware;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV1;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.FlinkSinkUtil;
import com.yuange.flinkrealtime.util.YuangeCommonUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

/**
 * @作者:袁哥
 * @时间:2021/8/2 8:48
 */
public class DwmUvApp extends BaseAppV1 {

    public static void main(String[] args) {
        new DwmUvApp().init(
                3001,
                1,
                "DwmUvApp",
                "DwmUvApp",
                Constant.TOPIC_DWD_PAGE //从page日志主题中读取数据
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, DataStreamSource<String> sourceStream) {
        sourceStream
                .map(JSON::parseObject)    //将数据转化为JSON格式
                .assignTimestampsAndWatermarks( //添加水印
                        WatermarkStrategy
                                .<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((log, ts) -> log.getLong("ts"))
                )
                .keyBy(obj -> obj.getJSONObject("common").getString("mid")) //按设备id分组
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))   //开一个5秒的滚动窗口
                .process(new ProcessWindowFunction<JSONObject, JSONObject, String, TimeWindow>() {
                    ValueState<Long> firstVisitState;
                    SimpleDateFormat simpleDateFormat;
                    @Override
                    public void open(Configuration parameters) throws Exception {
                        firstVisitState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("firstVisitState", Long.class));
                        simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
                    }

                    @Override
                    public void process(String s, Context context, Iterable<JSONObject> elements, Collector<JSONObject> out) throws Exception {
                        //当第二天的时候, uv重新开启新的去重
                        String today = simpleDateFormat.format(context.window().getEnd());  //以窗口关闭时间作为今天
                        String yesterday = simpleDateFormat.format(firstVisitState.value() == null ? 0L : firstVisitState.value()); //若firstVisitState状态为null,说明用户今天还没有访问

                        if (!today.equals(yesterday)){
                            firstVisitState.clear();
                        }

                        if (firstVisitState.value() == null){
                            List<JSONObject> list = YuangeCommonUtil.toList(elements);
                            JSONObject min = Collections.min(list, Comparator.comparing(o -> o.getLong("ts"))); //排序,然后取时间最早的那条记录
                            out.collect(min);
                            firstVisitState.update(min.getLong("ts"));
                        }
                    }
                })
                .map(JSONAware::toJSONString)
                .addSink(FlinkSinkUtil.getKafkaSink(Constant.TOPIC_DWM_UV));
    }
}

2.4 测试数据是否可以到DWM层

  1)启动Hadoop集群

hadoop.sh start

  2)启动Zookeeper

zk start

  3)启动Kafka

kafka.sh start

  4)启动日志服务器(之前写在log-lg.sh脚本中)

log-lg.sh start

  5)启动flink的yarn-session模式

/opt/module/flink-yarn/bin/yarn-session.sh -d

  6)修改之前写的realtime.sh脚本

#!/bin/bash
flink=/opt/module/flink-yarn/bin/flink
jar=/opt/module/applog/flink-realtime-1.0-SNAPSHOT.jar

apps=(
        com.yuange.flinkrealtime.app.dwd.DwdLogApp
        com.yuange.flinkrealtime.app.dwd.DwdDbApp
        com.yuange.flinkrealtime.app.dwm.DwmUvApp
)

for app in ${apps[*]} ; do
        $flink run -d -c $app $jar
done

  7)运行脚本,将Flink程序提交至yarn-session中(在此之前先把打包好的jar包上传至指定位置)

  8)程序启动后生产日志数据,模拟新增

cd /opt/software/mock/mock_log
java -jar gmall2020-mock-log-2020-12-18.jar

  9)也可以启动消费者消费dwn_uv主题

consume dwn_uv

第3章 DWM层: 跳出明细

3.1 需求分析与思路

3.1.1 什么是跳出

  跳出就是用户成功访问了网站的入口页面(例如首页)后就退出,不再继续访问网站的其它页面跳出率计算公式:跳出率=访问一个页面后离开网站的次数 / 总访问次数

  观察关键词的跳出率就可以得知用户对网站内容的认可,或者说你的网站是否对用户有吸引力。而网站的内容是否能够对用户有所帮助留住用户也直接可以在跳出率中看出来,所以跳出率是衡量网站内容质量的重要标准。

  关注跳出率,可以看出引流过来的访客是否能很快的被吸引,渠道引流过来的用户之间的质量对比,对于应用优化前后跳出率的对比也能看出优化改进的成果。

3.1.2 计算跳出率的思路

  首先要识别哪些是跳出行为,要把这些跳出的访客最后一个访问的页面识别出来。那么要抓住几个特征:

  1)该页面是用户近期访问的第一个页面:这个可以通过该页面是否有上一个页面(last_page_id)来判断,如果这个表示为空,就说明这是这个访客这次访问的第一个页面。

  2)首次访问之后很长一段时间(自己设定),用户没继续再有其他页面的访问。

    这个访问的判断,其实有点麻烦,首先这不是用一条数据就能得出结论的,需要组合判断,要用一条存在的数据和不存在的数据进行组合判断。而且要通过一个不存在的数据求得一条存在的数据。更麻烦的他并不是永远不存在,而是在一定时间范围内不存在。那么如何识别有一定失效的组合行为呢?

    最简单的办法就是Flink自带的CEP技术。这个CEP非常适合通过多条数据组合来识别某个事件。

  3)用户跳出事件本质上就是一个条件事件加一个超时事件的组合

3.1.3 具体实现代码

  1)确认是否添加了CEP的依赖包

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cep_2.12</artifactId>
    <version>1.13.1</version>
</dependency>

  2)使用Event-time

  3)按照mid分组: 所有行为肯定是基于相同的mid来计算

  4)定义模式:  首次进入, 30s内跟着一个多个访问记录

  5)取出那些超时的数据就是我们想要的

  6)测试版本代码

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV1;
import com.yuange.flinkrealtime.common.Constant;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.PatternTimeoutFunction;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.OutputTag;

import java.time.Duration;
import java.util.List;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 19:16
 */
public class DwmJumpDetailApp extends BaseAppV1 {

    public static void main(String[] args) {
        new DwmJumpDetailApp().init(
                3002,
                1,
                "DwmJumpDetailApp",
                "DwmJumpDetailApp",
                Constant.TOPIC_DWD_PAGE     //消费page主题中的数据
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, DataStreamSource<String> sourceStream) {
        sourceStream =
                environment.fromElements(
                        "{\"common\":{\"mid\":\"101\"},\"page\":{\"page_id\":\"home\"},\"ts\":10000} ",
                        "{\"common\":{\"mid\":\"101\"},\"page\":{\"page_id\":\"home\"},\"ts\":11000} ",
                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"home\"},\"ts\":10000}",
                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"good_list\",\"last_page_id\":" +
                                "\"home\"},\"ts\":17000} ",
                        "{\"common\":{\"mid\":\"102\"},\"page\":{\"page_id\":\"good_list\",\"last_page_id\":" +
                                "\"detail\"},\"ts\":50000} "
                );

        KeyedStream<JSONObject, String> stream = sourceStream.map(JSON::parseObject) //将数据转为JSON格式
                .assignTimestampsAndWatermarks( //添加水印
                        WatermarkStrategy
                                .<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((log, ts) -> log.getLong("ts"))
                )
                .keyBy(obj -> obj.getJSONObject("common").getString("mid"));//按照mid分组

        //定义一个入口,后面紧跟一个页面
        Pattern<JSONObject, JSONObject> pattern = Pattern
                .<JSONObject>begin("entry")
                .where(new SimpleCondition<JSONObject>() { //入口的条件是用户没有访问其他页面
                    @Override
                    public boolean filter(JSONObject value) throws Exception {
                        String lastPageId = value.getJSONObject("page").getString("last_page_id");  //获取上一个页面的id
                        return lastPageId == null || lastPageId.isEmpty();
                    }
                })
                .next("nextPage")
                //让那些进来之后pageId和last_page_id不为空的数据留下(即该条数据进入窗口后,从上一个页面跳到了下一个页面)
                //如果这条数据超时,说明它没有跳到其它页面,而这正是我们想要的结果)
                .where(new SimpleCondition<JSONObject>() {
                    @Override
                    public boolean filter(JSONObject value) throws Exception {
                        JSONObject page = value.getJSONObject("page");
                        String page_id = page.getString("page_id");
                        String last_page_id = page.getString("last_page_id");
                        return page_id != null && last_page_id != null && !last_page_id.isEmpty();
                    }
                })
                .within(Time.seconds(5));//开一个5秒的窗口

        //模式应用到流上
        PatternStream<JSONObject> ps = CEP.pattern(stream, pattern);

        //取出满足模式数据(或者超时数据)
        SingleOutputStreamOperator<JSONObject> normal = ps.select(
                new OutputTag<JSONObject>("timeout") {  //侧输出流,存放超时数据
                },
                new PatternTimeoutFunction<JSONObject, JSONObject>() {
                    @Override
                    public JSONObject timeout(Map<String, List<JSONObject>> pattern,
                                              long timeoutTimestamp) throws Exception {
                        //超时数据, 就是跳出明细
                        return pattern.get("entry").get(0); //从entry窗口中获取JSON数据
                    }
                },
                new PatternSelectFunction<JSONObject, JSONObject>() {
                    @Override
                    public JSONObject select(Map<String, List<JSONObject>> map) throws Exception {
                        return null;    //满足正常访问的数据, 不用返回
                    }
                }
        );

        normal.getSideOutput(new OutputTag<JSONObject>("timeout"){}).print("jump");
    }
}

  7)升级版

    (1)在Constant中添加常量

public static final String TOPIC_DWM_USER_JUMP_DETAIL = "dwm_user_jump_detail";

    (2)DwmJumpDetailApp_Two

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONAware;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV1;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.FlinkSinkUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.PatternTimeoutFunction;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.OutputTag;

import java.time.Duration;
import java.util.List;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 19:51
 * 一个用户跳出有什么特质
 *     1. 当前页面应该是入口
 *         上一个页面是空
 *     2. 过了一段时间, 没有再访问其他页面
 *         超时时间
 */
public class DwmJumpDetailApp_Two extends BaseAppV1 {
    public static void main(String[] args) {
        new DwmJumpDetailApp_Two().init(
                3002,
                1,
                "DwmJumpDetailApp_Two",
                "DwmJumpDetailApp_Two",
                Constant.TOPIC_DWD_PAGE     //消费page主题中的数据
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, DataStreamSource<String> sourceStream) {
        KeyedStream<JSONObject, String> stream = sourceStream.map(JSON::parseObject) //将数据转为JSON格式
                .assignTimestampsAndWatermarks( //添加水印
                        WatermarkStrategy
                                .<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((log, ts) -> log.getLong("ts"))
                )
                .keyBy(obj -> obj.getJSONObject("common").getString("mid"));//按照mid分组

        //定义一个入口,后面紧跟一个页面
        Pattern<JSONObject, JSONObject> pattern = Pattern
                .<JSONObject>begin("entry")
                //入口的条件
                .where(new SimpleCondition<JSONObject>() { //入口的条件是用户没有访问其他页面
                    @Override
                    public boolean filter(JSONObject value) throws Exception {
                        String lastPageId = value.getJSONObject("page").getString("last_page_id");  //获取上一个页面的id
                        return lastPageId == null || lastPageId.isEmpty();
                    }
                })
                .next("nextPage")
                //入口的条件
                .where(new SimpleCondition<JSONObject>() {
                    @Override
                    public boolean filter(JSONObject value) throws Exception {
                        String lastPageId = value.getJSONObject("page").getString("last_page_id");  //获取上一个页面的id
                        return lastPageId == null || lastPageId.isEmpty();
                    }
                })
                .within(Time.seconds(5));//开一个5秒的窗口

        //模式应用到流上
        PatternStream<JSONObject> ps = CEP.pattern(stream, pattern);

        //取出满足模式数据(或者超时数据)
        SingleOutputStreamOperator<JSONObject> normal = ps.select(
                new OutputTag<JSONObject>("timeout") {  //侧输出流,存放超时数据
                },
                new PatternTimeoutFunction<JSONObject, JSONObject>() {
                    @Override
                    public JSONObject timeout(Map<String, List<JSONObject>> pattern,
                                              long timeoutTimestamp) throws Exception {
                        return pattern.get("entry").get(0); //从entry窗口中获取JSON数据
                    }
                },
                new PatternSelectFunction<JSONObject, JSONObject>() {
                    @Override
                    public JSONObject select(Map<String, List<JSONObject>> pattern) throws Exception {
                        return pattern.get("entry").get(0); //从entry窗口中获取JSON数据
                    }
                }
        );

        normal.union(normal.getSideOutput(new OutputTag<JSONObject>("timeout"){}))  //将超时的数据和主数据流中的数据union
                .map(JSONAware::toJSONString)
                .addSink(FlinkSinkUtil.getKafkaSink(Constant.TOPIC_DWM_USER_JUMP_DETAIL));
    }
}

3.1.4 测试

  1)将程序打包并上传至/opt/module/applog

  2)停止yarn-session

  3)修改realtime.sh脚本

vim /home/atguigu/bin/realtime.sh
#!/bin/bash
flink=/opt/module/flink-yarn/bin/flink
jar=/opt/module/applog/flink-realtime-1.0-SNAPSHOT.jar

apps=(
        #com.yuange.flinkrealtime.app.dwd.DwdLogApp
        #com.yuange.flinkrealtime.app.dwd.DwdDbApp
        #com.yuange.flinkrealtime.app.dwm.DwmUvApp
        com.yuange.flinkrealtime.app.dwm.DwmJumpDetailApp_Two
)

for app in ${apps[*]} ; do
        $flink run -d -c $app $jar
done

  4)启动yarn-session

/opt/module/flink-yarn/bin/yarn-session.sh -d

  5)启动realtime.sh脚本

realtime.sh

  6)启动一个消费者,消费dwm_user_jump_detail主题中的数据

consume dwm_user_jump_detail

  7)模拟生产日志数据

cd /opt/software/mock/mock_log
java -jar gmall2020-mock-log-2020-12-18.jar

  8)查看消费情况

第4章 DWM层: 订单宽表

4.1 需求分析与思路

  订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户、地区、商品、品类、品牌等等。为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合成为一张订单的宽表

  那究竟哪些数据需要和订单整合在一起?

  如上图,由于在之前的操作我们已经把数据分拆成了事实数据和维度数据,事实数据(绿色)进入kafka数据流(DWD层)中,维度数据(蓝色)进入hbase中长期保存。那么我们在DWM层中要把实时和维度数据进行整合关联在一起,形成宽表。那么这里就要处理有两种关联,事实数据和事实数据关联、事实数据和维度数据关联。

  1)事实数据和事实数据关联,其实就是流与流之间的关联。

  2)事实数据与维度数据关联,其实就是流计算中查询外部数据源

4.2 订单和订单明细关联

4.2.1 用到的POJO类

  1)订单表POJO

package com.yuange.flinkrealtime.bean;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.math.BigDecimal;
import java.text.ParseException;
import java.text.SimpleDateFormat;

/**
 * @作者:袁哥
 * @时间:2021/8/2 20:58
 */
@AllArgsConstructor
@NoArgsConstructor
@Data
public class OrderInfo {
    private Long id;
    private Long province_id;
    private String order_status;
    private Long user_id;
    private BigDecimal total_amount;
    private BigDecimal activity_reduce_amount;
    private BigDecimal coupon_reduce_amount;
    private BigDecimal original_total_amount;
    private BigDecimal feight_fee;
    private String expire_time;
    private String create_time;
    private String operate_time;
    private String create_date; // 把其他字段处理得到
    private String create_hour;
    private Long create_ts;

    // 为了create_ts时间戳赋值, 所以需要手动补充
    public void setCreate_time(String create_time) throws ParseException {
        this.create_time = create_time;

        this.create_date = this.create_time.substring(0, 10); // 年月日
        this.create_hour = this.create_time.substring(11, 13); // 小时

        final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        this.create_ts = sdf.parse(create_time).getTime();
    }
}

  2)订单明细表POJO

package com.yuange.flinkrealtime.bean;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.math.BigDecimal;
import java.text.ParseException;
import java.text.SimpleDateFormat;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:01
 */
@Data
@AllArgsConstructor
@NoArgsConstructor
public class OrderDetail {
    private Long id;
    private Long order_id;
    private Long sku_id;
    private BigDecimal order_price;
    private Long sku_num;
    private String sku_name;
    private String create_time;
    private BigDecimal split_total_amount;
    private BigDecimal split_activity_amount;
    private BigDecimal split_coupon_amount;
    private Long create_ts;
    // 为了create_ts时间戳赋值, 所以需要手动补充
    public void setCreate_time(String create_time) throws ParseException {
        this.create_time = create_time;
        final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        this.create_ts = sdf.parse(create_time).getTime();

    }
}

  3)join之后的宽表POJO

package com.yuange.flinkrealtime.bean;

import com.alibaba.fastjson.JSON;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.math.BigDecimal;
import java.text.ParseException;
import java.text.SimpleDateFormat;

import static java.lang.Integer.parseInt;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:12
 */
@Data
@AllArgsConstructor
@NoArgsConstructor
public class OrderWide {
    private Long detail_id;
    private Long order_id;
    private Long sku_id;
    private BigDecimal order_price;
    private Long sku_num;
    private String sku_name;
    private Long province_id;
    private String order_status;
    private Long user_id;

    private BigDecimal total_amount;
    private BigDecimal activity_reduce_amount;
    private BigDecimal coupon_reduce_amount;
    private BigDecimal original_total_amount;
    private BigDecimal feight_fee;
    private BigDecimal split_feight_fee;
    private BigDecimal split_activity_amount;
    private BigDecimal split_coupon_amount;
    private BigDecimal split_total_amount;

    private String expire_time;
    private String create_time;
    private String operate_time;
    private String create_date; // 把其他字段处理得到
    private String create_hour;

    private String province_name;//查询维表得到
    private String province_area_code;
    private String province_iso_code;
    private String province_3166_2_code;

    private Integer user_age;
    private String user_gender;

    private Long spu_id;     //作为维度数据 要关联进来
    private Long tm_id;
    private Long category3_id;
    private String spu_name;
    private String tm_name;
    private String category3_name;

    public OrderWide(OrderInfo orderInfo, OrderDetail orderDetail) {
        mergeOrderInfo(orderInfo);
        mergeOrderDetail(orderDetail);

    }

    public void mergeOrderInfo(OrderInfo orderInfo) {
        if (orderInfo != null) {
            this.order_id = orderInfo.getId();
            this.order_status = orderInfo.getOrder_status();
            this.create_time = orderInfo.getCreate_time();
            this.create_date = orderInfo.getCreate_date();
            this.create_hour = orderInfo.getCreate_hour();
            this.activity_reduce_amount = orderInfo.getActivity_reduce_amount();
            this.coupon_reduce_amount = orderInfo.getCoupon_reduce_amount();
            this.original_total_amount = orderInfo.getOriginal_total_amount();
            this.feight_fee = orderInfo.getFeight_fee();
            this.total_amount = orderInfo.getTotal_amount();
            this.province_id = orderInfo.getProvince_id();
            this.user_id = orderInfo.getUser_id();
        }
    }

    public void mergeOrderDetail(OrderDetail orderDetail) {
        if (orderDetail != null) {
            this.detail_id = orderDetail.getId();
            this.sku_id = orderDetail.getSku_id();
            this.sku_name = orderDetail.getSku_name();
            this.order_price = orderDetail.getOrder_price();
            this.sku_num = orderDetail.getSku_num();
            this.split_activity_amount = orderDetail.getSplit_activity_amount();
            this.split_coupon_amount = orderDetail.getSplit_coupon_amount();
            this.split_total_amount = orderDetail.getSplit_total_amount();
        }
    }

    public void setUser_age(String birthday){
        try {
            this.user_age = parseInt(birthday);
        } catch (Exception e) {
            try {
                long bir = new SimpleDateFormat("yyyy-MM-dd").parse(birthday).getTime();
                this.user_age = Math.toIntExact((System.currentTimeMillis() - bir) / 1000 / 60 / 60 / 24 / 365);
            } catch (ParseException e1) {
                e1.printStackTrace();
            }
        }
    }

    public String toJsonString(){
        return JSON.toJSONString(this);
    }
}

4.2.2 join代码清单

  1)升级BaseAppBaseAppV2

    由于前面封装的BaseApp只能消费一个topic得到一个流, 这次需要消费多个topic, 得到多个流. 所以需要对前面的BaseApp进行重构,重构init方法和抽象的run

package com.yuange.flinkrealtime.app;

import com.yuange.flinkrealtime.util.FlinkSourceUtil;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.runtime.state.hashmap.HashMapStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:13
 */
public abstract class BaseAppV2 {

    public void init(int port, int p, String ck, String groupId, String topic,String... otherTopics){
        System.setProperty("HADOOP_USER_NAME","atguigu");
        Configuration configuration = new Configuration();
        configuration.setInteger("rest.port",port);
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(configuration).setParallelism(p);

        environment.enableCheckpointing(5000);  //检查点之间的时间间隔,单位是毫秒
        environment.setStateBackend(new HashMapStateBackend()); //定义状态后端,以保证将检查点状态写入远程(HDFS)
        environment.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop162:8020/flinkparent/ck/" + ck);   //配置检查点存放地址

        environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); //设置检查点模式:精准一次
        environment.getCheckpointConfig().setMaxConcurrentCheckpoints(1);   //设置检查点失败时重试次数
        environment.getCheckpointConfig()
                .enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);  //设置检查点持久化:取消作业时保留外部化检查点

        HashSet<String> topics = new HashSet<>(Arrays.asList(otherTopics));
        topics.add(topic);

        Map<String, DataStreamSource<String>> streams = new HashMap<>();
        for (String t : topics) {

            DataStreamSource<String> stream = environment.addSource(FlinkSourceUtil.getKafkaSource(groupId, t));
            streams.put(t, stream);
        }

        run(environment,streams);

        try {
            environment.execute(ck);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    protected abstract void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams);
}

  2)在Constant添加常量

public static final String TOPIC_DWD_ORDER_INFO = "dwd_order_info";
public static final String TOPIC_DWD_ORDER_DETAIL = "dwd_order_detail";

  3)具体的join代码

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV2;
import com.yuange.flinkrealtime.bean.OrderDetail;
import com.yuange.flinkrealtime.bean.OrderInfo;
import com.yuange.flinkrealtime.bean.OrderWide;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.DimUtil;
import com.yuange.flinkrealtime.util.JdbcUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.sql.Connection;
import java.time.Duration;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:18
 */
public class DwmOrderWideApp extends BaseAppV2 {

    public static void main(String[] args) {
        new DwmOrderWideApp().init(
                3003,
                1,
                "DwmOrderWideApp",
                "DwmOrderWideApp",
                Constant.TOPIC_DWD_ORDER_INFO, Constant.TOPIC_DWD_ORDER_DETAIL
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams) {
        // 1. 事实表进行join
        SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator = factJoin(streams);
    }

    private SingleOutputStreamOperator<OrderWide> factJoin(Map<String, DataStreamSource<String>> streams) {
        KeyedStream<OrderInfo, Long> orderInfoStream = streams.get(Constant.TOPIC_DWD_ORDER_INFO)
                .map(info -> JSON.parseObject(info, OrderInfo.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((info, ts) -> info.getCreate_ts())
                )
                .keyBy(OrderInfo::getId);

        KeyedStream<OrderDetail, Long> orderDetailStream = streams.get(Constant.TOPIC_DWD_ORDER_DETAIL)
                .map(info -> JSON.parseObject(info, OrderDetail.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((detail, ts) -> detail.getCreate_ts())
                )
                .keyBy(OrderDetail::getOrder_id);

        return orderInfoStream.intervalJoin(orderDetailStream)
                .between(Time.seconds(-5),Time.seconds(5))
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo left,
                                               OrderDetail right,
                                               Context ctx,
                                               Collector<OrderWide> out) throws Exception {
                        out.collect(new OrderWide(left, right));
                    }
                });
    }
}

4.3 关联维度表

  维度关联实际上就是在流中查询存储在hbase中的数据表但是即使通过主键的方式查询,hbase速度的查询也是不及流之间的join。外部数据源的查询常常是流式计算的性能瓶颈,所以咱们再这个基础上还有进行一定的优化。

4.3.1 把维度数据初始化到HBase

  1)开启 hbase

start-hbase.sh

  2)启动maxwell

maxwell.sh start

  3)启动yarn-session

/opt/module/flink-yarn/bin/yarn-session.sh -d

  4)运行realtime.sh脚本,启动DwdDbApp

  5)使用maxwell的bootstrap导入维度数据,共用到6张维度表: user_info, base_province, sku_info, spu_info, base_category3, base_trademark

cd /opt/module/maxwell-1.27.1/
bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table user_info --client_id maxwell_1

bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table base_province --client_id maxwell_1

bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table sku_info --client_id maxwell_1

bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table spu_info --client_id maxwell_1

bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table base_category3 --client_id maxwell_1

bin/maxwell-bootstrap --user maxwell  --password aaaaaa --host hadoop162  --database flinkdb --table base_trademark --client_id maxwell_1

  6)查看是否导入数据

/opt/module/phoenix-5.0.0/bin/sqlline.py

4.3.2 封装JDBCUtil

  1)在JdbcUtil中添加queryList()方法

public static <T> List<T> queryList(Connection conn, String sql, Object[] args, Class<T> tClass) throws SQLException, IllegalAccessException, InstantiationException, InvocationTargetException {
        List<T> result = new ArrayList<>();
        // 通过一个sql查询数据, 应该得到多行, 每行封装到一个T类型的对象中
        PreparedStatement ps = conn.prepareStatement(sql);

        // 1. 给ps的占位符进行赋值
        for (int i = 0; args != null && i < args.length; i++) {
            ps.setObject(i + 1, args[i]);
        }

        // 2. 执行sql, 得到结果集      id  name  age
        ResultSet resultSet = ps.executeQuery();
        ResultSetMetaData metaData = resultSet.getMetaData();
        while (resultSet.next()) {
            T t = tClass.newInstance(); // 使用无参构造函数进行创建对象   new User()

            for (int i = 1; i <= metaData.getColumnCount(); i++) { // 列的索引从1开始
                String columnName = metaData.getColumnLabel(i);  // 获取列名的别名(如果有)
                Object value = resultSet.getObject(columnName);
                BeanUtils.setProperty(t, columnName, value);
            }
            result.add(t);
        }

        return result;
    }

  2)新建DimUtil

package com.yuange.flinkrealtime.util;

import com.alibaba.fastjson.JSONObject;

import java.lang.reflect.InvocationTargetException;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.List;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:30
 */
public class DimUtil {
    public static JSONObject readDimFromPhoenix(Connection phoenixConn, String tableName, String id) throws SQLException, IllegalAccessException, InvocationTargetException, InstantiationException {
        // 通过jdbc去Phoenix查询数据
        String sql = "select * from " + tableName + " where id = ?";

        List<JSONObject> list = JdbcUtil.<JSONObject>queryList(phoenixConn, sql, new Object[]{id}, JSONObject.class);

        return list.size() == 0 ? new JSONObject() : list.get(0);
    }
}

  3)在Constant中新建常量

public static final String DIM_USER_INFO = "DIM_USER_INFO";
    public static final String DIM_BASE_PROVINCE = "DIM_BASE_PROVINCE";
    public static final String DIM_SKU_INFO = "DIM_SKU_INFO";
    public static final String DIM_SPU_INFO = "DIM_SPU_INFO";
    public static final String DIM_BASE_TRADEMARK = "DIM_BASE_TRADEMARK";
    public static final String DIM_BASE_CATEGORY3 = "DIM_BASE_CATEGORY3";

4.3.3 join维度表代码清单

  1)代码如下

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV2;
import com.yuange.flinkrealtime.bean.OrderDetail;
import com.yuange.flinkrealtime.bean.OrderInfo;
import com.yuange.flinkrealtime.bean.OrderWide;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.DimUtil;
import com.yuange.flinkrealtime.util.JdbcUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.sql.Connection;
import java.time.Duration;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:18
 */
public class DwmOrderWideApp extends BaseAppV2 {

    public static void main(String[] args) {
        new DwmOrderWideApp().init(
                3003,
                1,
                "DwmOrderWideApp",
                "DwmOrderWideApp",
                Constant.TOPIC_DWD_ORDER_INFO, Constant.TOPIC_DWD_ORDER_DETAIL
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams) {
        // 1. 事实表进行join
        SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator = factJoin(streams);
        // 2. join的维度数据
        SingleOutputStreamOperator<OrderWide> orderWideStreamWithDim = dimJoin(orderWideSingleOutputStreamOperator);
        orderWideStreamWithDim.print();
        // 3. 把宽表写入到dwm层(kafka)
    }

    private SingleOutputStreamOperator<OrderWide> dimJoin(SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator) {
        /*join维度:
        每一个OrderWide都去hbase里查询相应的维度*/
        return orderWideSingleOutputStreamOperator.map(new RichMapFunction<OrderWide, OrderWide>() {
            private Connection phoenixConn;

            @Override
            public void open(Configuration parameters) throws Exception {
                phoenixConn = JdbcUtil.getPhoenixConnection();
            }

            @Override
            public OrderWide map(OrderWide wide) throws Exception {
                // 补充 dim_user_info  select * from t where id=?
                JSONObject userInfo = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_USER_INFO, wide.getUser_id().toString());
                wide.setUser_gender(userInfo.getString("GENDER"));
                wide.setUser_age(userInfo.getString("BIRTHDAY"));

                // 2. 省份
                JSONObject baseProvince = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_BASE_PROVINCE, wide.getProvince_id().toString());
                wide.setProvince_3166_2_code(baseProvince.getString("ISO_3166_2"));
                wide.setProvince_area_code(baseProvince.getString("AREA_CODE"));
                wide.setProvince_iso_code(baseProvince.getString("ISO_CODE"));
                wide.setProvince_name(baseProvince.getString("NAME"));

                // 3. sku
                JSONObject skuInfo = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_SKU_INFO, wide.getSku_id().toString());
                wide.setSku_name(skuInfo.getString("SKU_NAME"));

                wide.setSpu_id(skuInfo.getLong("SPU_ID"));
                wide.setTm_id(skuInfo.getLong("TM_ID"));
                wide.setCategory3_id(skuInfo.getLong("CATEGORY3_ID"));

                // 4. spu
                JSONObject spuInfo = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_SPU_INFO, wide.getSpu_id().toString());
                wide.setSpu_name(spuInfo.getString("SPU_NAME"));
                // 5. tm
                JSONObject tmInfo = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_BASE_TRADEMARK, wide.getTm_id().toString());
                wide.setTm_name(tmInfo.getString("TM_NAME"));

                // 5. c3
                JSONObject c3Info = DimUtil.readDimFromPhoenix(phoenixConn, Constant.DIM_BASE_CATEGORY3, wide.getCategory3_id().toString());
                wide.setCategory3_name(c3Info.getString("NAME"));
                return wide;
            }
        });
    }

    private SingleOutputStreamOperator<OrderWide> factJoin(Map<String, DataStreamSource<String>> streams) {
        KeyedStream<OrderInfo, Long> orderInfoStream = streams.get(Constant.TOPIC_DWD_ORDER_INFO)
                .map(info -> JSON.parseObject(info, OrderInfo.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((info, ts) -> info.getCreate_ts())
                )
                .keyBy(OrderInfo::getId);

        KeyedStream<OrderDetail, Long> orderDetailStream = streams.get(Constant.TOPIC_DWD_ORDER_DETAIL)
                .map(info -> JSON.parseObject(info, OrderDetail.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((detail, ts) -> detail.getCreate_ts())
                )
                .keyBy(OrderDetail::getOrder_id);

        return orderInfoStream.intervalJoin(orderDetailStream)
                .between(Time.seconds(-5),Time.seconds(5))
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo left,
                                               OrderDetail right,
                                               Context ctx,
                                               Collector<OrderWide> out) throws Exception {
                        out.collect(new OrderWide(left, right));
                    }
                });
    }
}

  2)在IDEA中启动程序,然后生产业务数据,查看控制台是否能打印数据

cd /opt/software/mock/mock_db
java -jar gmall2020-mock-db-2020-12-23.jar

4.3.4 维度管理性能优化

  优化1: 加入旁路缓存模式 cache-aside-pattern)

    我们在上面实现的功能中,直接查询的Hbase。外部数据源的查询常常是流式计算的性能瓶颈,所以我们需要在上面实现的基础上进行一定的优化。我们这里使用旁路缓存旁路缓存模式是一种非常常见的按需分配缓存的模式。如下图,任何请求优先访问缓存,缓存命中,直接获得数据返回请求。如果未命中则,查询数据库,同时把结果写入缓存以备后续请求使用。

    这种缓存策略有几个注意点:

      1)缓存要设过期时间,不然冷数据会常驻缓存浪费资源。

      2)要考虑维度数据是否会发生变化,如果发生变化要主动更新缓存。

    缓存的选型:堆缓存(应用程序内存)或者独立缓存服务(redis,memcache)。

      堆缓存,从性能角度看更好,毕竟访问数据路径更短,减少过程消耗。但是管理性差,其他进程无法维护缓存中的数据, 也会比较消耗内存。

      独立缓存服务(redis,memcache)本事性能也不错,不过会有创建连接、网络IO等消耗。但是考虑到数据如果会发生变化,那还是独立缓存服务管理性更强,而且如果数据量特别大,独立缓存更容易扩展。

      因为咱们的维度数据都是可变数据,所以这里还是采用Redis管理缓存

    具体实现代码

      1)导入Jedis依赖

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>3.2.0</version>
</dependency>

      2)封装Jedis工具类

package com.yuange.flinkrealtime.util;

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

/**
 * @作者:袁哥
 * @时间:2021/8/3 8:35
 */
public class RedisUtil {
    static JedisPool pool;

    static {
        JedisPoolConfig conf = new JedisPoolConfig();
        conf.setMaxTotal(300);  //线程池中线程的最大数量
        conf.setMaxIdle(10);
        conf.setMaxWaitMillis(10000);   //等待时间
        conf.setMinIdle(4);
        conf.setTestOnCreate(true);
        conf.setTestOnBorrow(true);
        conf.setTestOnReturn(true);

        pool = new JedisPool(conf,"hadoop162",6379);
    }

    public static Jedis getRedisClient(){
        Jedis resource = pool.getResource();
        resource.select(1); //选择1号库
        return resource;
    }
}

    Redis中数据结构的选择:

      在redis使用什么数据结构? 考虑方便读取和保存, 并且能够单独给每条数据设置过期时间,如果不设置过期时间, 一些冷数据会比较消耗内存.  综合考虑之后选择使用: string

key

value

"dim_" + table + "_" + id

维度信息的json字符串

    join带缓存维度代码清单,为了方便使用, 需要重构前面的readDim代码

      (1)在DimUtil中添加如下内容:

public static JSONObject readDim(Connection phoenixConn, Jedis client, String tableName, String id) throws InvocationTargetException, SQLException, InstantiationException, IllegalAccessException {
        //先从redis读取
        JSONObject jsonObject = readDimFromRedis(client, tableName, id);
        if (jsonObject != null){
            return jsonObject;
        }else {
            jsonObject = readDimFromPhoenix(phoenixConn,tableName,id);
            //写入到redis
            writeDimToRedis(client,tableName,id,jsonObject);
       return jsonObject;  }
} private static void writeDimToRedis(Jedis client, String tableName, String id, JSONObject jsonObject) { String key = getRedisDimKey(tableName, id); String value = jsonObject.toJSONString(); client.setex(key, 24 * 60 * 60,value); } private static JSONObject readDimFromRedis(Jedis client, String tableName, String id) { String key = getRedisDimKey(tableName, id);//拼接key String value = client.get(key); if (value != null){ // 每个key如果读取到一次之后, 应该把过期重新设置24小时,保持热点数据不过时 client.expire(key, 24 * 60 * 60); return JSON.parseObject(value); } return null; } private static String getRedisDimKey(String tableName, String id) { return tableName + ":" + id; }

      (2)在PhoenixSink中添加对redis的更新,当Hbase中的维度信息发生变化后,也要对redis进行更新

package com.yuange.flinkrealtime.sink;

import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.bean.TableProcess;
import com.yuange.flinkrealtime.util.JdbcUtil;
import com.yuange.flinkrealtime.util.RedisUtil;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis;

import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/7/30 23:25
 */
public class PhoenixSink extends RichSinkFunction<Tuple2<JSONObject, TableProcess>> {
    Connection conn;
    ValueState<String> tableCreateState;
    private Jedis client;
    @Override
    public void open(Configuration parameters) throws Exception {
        //先加载驱动, 很多情况不是必须.
        //大部分常用的数据库会根据url自动选择合适的driver
        //Phoenix 驱动有些时候需要手动加载一下
        conn = JdbcUtil.getPhoenixConnection();
        //创建一个状态来管理table
        tableCreateState = getRuntimeContext().getState(new ValueStateDescriptor<String>("tableCreateState", String.class));
        client = RedisUtil.getRedisClient();
        client.select(1);
    }

    @Override
    public void invoke(Tuple2<JSONObject, TableProcess> value, Context context) throws Exception {
        // 1. 检测表, 如果表不存在就需要在Phoenix中新建表
        checkTable(value);
        // 2. 再把数据写入到phoenix中
        writeToPhoenix(value);
        // 3. 更新redis缓存  (读取维度数据缓存优化的时候, 添加这个代码)
        // 对缓存中已经存在的, 并且这次又更新的维度, 去更新缓冲中的维度. 新增的维度千万不要去写入到缓存
        // 粗暴: 直接把缓存删除
        updateCache(value);
    }

    private void updateCache(Tuple2<JSONObject, TableProcess> value) {
        // {"id": 1} => {"ID": 1}

        // 把json数据的key全部大写
        JSONObject data = new JSONObject();
        for (Map.Entry<String, Object> entry : value.f0.entrySet()) {
            data.put(entry.getKey().toUpperCase(), entry.getValue());
        }

        // 更新redis缓存
        // 这次是update, 并且redis中还存在
        String operateType = value.f1.getOperate_type();
        String key = value.f1.getSink_table().toUpperCase() + ":" + data.get("ID");

        String dim = client.get(key);
        if ("update".equals(operateType) && dim != null) {
            // 更新
            client.setex(key, 24 * 60 * 60, data.toJSONString());
        }
    }

    private void writeToPhoenix(Tuple2<JSONObject, TableProcess> value) throws SQLException {
        JSONObject data = value.f0;
        TableProcess tp = value.f1;

        // upsert  into user(id, name, age) values(?,?,?)
        //拼接SQL语句
        StringBuilder insertSql = new StringBuilder();
        insertSql
                .append("upsert into ")
                .append(tp.getSink_table())
                .append("(")
                //id,activity_name,activity_type,activity_desc,start_time,end_time,create_time
                .append(tp.getSink_columns())
                .append(")values(")
                //把非,部分替换为?
                .append(tp.getSink_columns().replaceAll("[^,]+","?"))
                .append(")");
        PreparedStatement ps = conn.prepareStatement(insertSql.toString());
        //给ps中的占位符赋值
        String[] columnNames = tp.getSink_columns().split(",");
        for (int i = 0; i < columnNames.length; i++) {
            //从JSONObject数据中取出对应字段的值
            Object str = data.getString(columnNames[i]);
            ps.setString(i + 1,str == null ? "" : str.toString());
        }

        ps.execute();
        conn.commit();
        ps.close();
    }

    private void checkTable(Tuple2<JSONObject, TableProcess> value) throws IOException, SQLException {
        if (tableCreateState.value() == null){
            // 执行sql语句   create table if not exists user(id varchar, age varchar )
            TableProcess tp = value.f1;
            // 拼接sql语句
            StringBuilder createSql = new StringBuilder();
            createSql
                    .append("create table if not exists ")
                    .append(tp.getSink_table())
                    .append("(")
                    .append(tp.getSink_columns().replaceAll(","," varchar,"))
                    .append(" varchar, constraint pk primary key(")
                    .append(tp.getSink_pk() == null ? "id" : tp.getSink_pk())
                    .append("))")
                    .append(tp.getSink_extend() == null ? "" : tp.getSink_extend());

            PreparedStatement ps = conn.prepareStatement(createSql.toString());
            ps.execute();
            conn.commit();
            ps.close();
            //更新状态
            tableCreateState.update(tp.getSink_table());
        }
    }
}

      维度信息变化后缓存的处理,维度信息发生变化后, 如果缓存中存在需要变化的数据, 应该立即删除缓存数据或者更新缓存数据,新建DwmOrderWideApp_Cache类

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV2;
import com.yuange.flinkrealtime.bean.OrderDetail;
import com.yuange.flinkrealtime.bean.OrderInfo;
import com.yuange.flinkrealtime.bean.OrderWide;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.DimUtil;
import com.yuange.flinkrealtime.util.JdbcUtil;
import com.yuange.flinkrealtime.util.RedisUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import redis.clients.jedis.Jedis;

import java.sql.Connection;
import java.time.Duration;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:18
 */
public class DwmOrderWideApp_Cache extends BaseAppV2 {

    public static void main(String[] args) {
        new DwmOrderWideApp_Cache().init(
                3003,
                1,
                "DwmOrderWideApp",
                "DwmOrderWideApp",
                Constant.TOPIC_DWD_ORDER_INFO, Constant.TOPIC_DWD_ORDER_DETAIL
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams) {
        // 1. 事实表进行join
        SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator = factJoin(streams);
        // 2. join的维度数据
        SingleOutputStreamOperator<OrderWide> orderWideStreamWithDim = dimJoin(orderWideSingleOutputStreamOperator);
        orderWideStreamWithDim.print();
        // 3. 把宽表写入到dwm层(kafka)
    }

    private SingleOutputStreamOperator<OrderWide> dimJoin(SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator) {
        /*join维度:
        每一个OrderWide都去hbase里查询相应的维度*/
        return orderWideSingleOutputStreamOperator.map(new RichMapFunction<OrderWide, OrderWide>() {
            private Connection phoenixConn;
            Jedis redisClient;
            @Override
            public void open(Configuration parameters) throws Exception {
                phoenixConn = JdbcUtil.getPhoenixConnection();
                redisClient = RedisUtil.getRedisClient();
            }

            @Override
            public OrderWide map(OrderWide wide) throws Exception {
                // 补充 dim_user_info  select * from t where id=?
                JSONObject userInfo = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_USER_INFO, wide.getUser_id().toString());
                wide.setUser_gender(userInfo.getString("GENDER"));
                wide.setUser_age(userInfo.getString("BIRTHDAY"));

                // 2. 省份
                JSONObject baseProvince = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_BASE_PROVINCE, wide.getProvince_id().toString());
                wide.setProvince_3166_2_code(baseProvince.getString("ISO_3166_2"));
                wide.setProvince_area_code(baseProvince.getString("AREA_CODE"));
                wide.setProvince_iso_code(baseProvince.getString("ISO_CODE"));
                wide.setProvince_name(baseProvince.getString("NAME"));

                // 3. sku
                JSONObject skuInfo = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_SKU_INFO, wide.getSku_id().toString());
                wide.setSku_name(skuInfo.getString("SKU_NAME"));

                wide.setSpu_id(skuInfo.getLong("SPU_ID"));
                wide.setTm_id(skuInfo.getLong("TM_ID"));
                wide.setCategory3_id(skuInfo.getLong("CATEGORY3_ID"));

                // 4. spu
                JSONObject spuInfo = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_SPU_INFO, wide.getSpu_id().toString());
                wide.setSpu_name(spuInfo.getString("SPU_NAME"));
                // 5. tm
                JSONObject tmInfo = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_BASE_TRADEMARK, wide.getTm_id().toString());
                wide.setTm_name(tmInfo.getString("TM_NAME"));

                // 5. c3
                JSONObject c3Info = DimUtil.readDim(phoenixConn,redisClient, Constant.DIM_BASE_CATEGORY3, wide.getCategory3_id().toString());
                wide.setCategory3_name(c3Info.getString("NAME"));
                return wide;
            }

            @Override
            public void close() throws Exception {
                if (redisClient != null){
                    redisClient.close();
                }
                if (phoenixConn != null){
                    phoenixConn.close();
                }
            }
        });

    }

    private SingleOutputStreamOperator<OrderWide> factJoin(Map<String, DataStreamSource<String>> streams) {
        KeyedStream<OrderInfo, Long> orderInfoStream = streams.get(Constant.TOPIC_DWD_ORDER_INFO)
                .map(info -> JSON.parseObject(info, OrderInfo.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((info, ts) -> info.getCreate_ts())
                )
                .keyBy(OrderInfo::getId);

        KeyedStream<OrderDetail, Long> orderDetailStream = streams.get(Constant.TOPIC_DWD_ORDER_DETAIL)
                .map(info -> JSON.parseObject(info, OrderDetail.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((detail, ts) -> detail.getCreate_ts())
                )
                .keyBy(OrderDetail::getOrder_id);

        return orderInfoStream.intervalJoin(orderDetailStream)
                .between(Time.seconds(-5),Time.seconds(5))
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo left,
                                               OrderDetail right,
                                               Context ctx,
                                               Collector<OrderWide> out) throws Exception {
                        out.collect(new OrderWide(left, right));
                    }
                });
    }
}

    测试:

      (1)启动Hadoop、Zookeeper、Kafka、maxwell、HBase、Flink的yarn-session

      (2)编写Redis启动脚本,并赋予可执行权限

vim /home/atguigu/bin/redis.sh
#!/bin/bash
echo "在hadoop162上启动redis服务器"
/usr/local/bin/redis-server /opt/module/redis-3.2.12/redis.conf
chmod +x /home/atguigu/bin/redis.sh

      (3)运行脚本,启动redis

redis.sh

      (4)使用Maven打包,将Jar包上传至/opt/module/applog

      (5)修改realtime.sh

vim /home/atguigu/bin/realtime.sh
#!/bin/bash
flink=/opt/module/flink-yarn/bin/flink
jar=/opt/module/applog/flink-realtime-1.0-SNAPSHOT.jar

apps=(
        #com.yuange.flinkrealtime.app.dwd.DwdLogApp
        com.yuange.flinkrealtime.app.dwd.DwdDbApp
        #com.yuange.flinkrealtime.app.dwm.DwmUvApp
        #com.yuange.flinkrealtime.app.dwm.DwmJumpDetailApp_Two
        com.yuange.flinkrealtime.app.dwm.DwmOrderWideApp_Cache
)

for app in ${apps[*]} ; do
        $flink run -d -c $app $jar
done

      (6)运行realtime.sh脚本,将程序提交至yarn-session

realtime.sh

      (7)模拟生产日志,进行测试

cd /opt/software/mock/mock_db
java -jar gmall2020-mock-db-2020-12-23.jar

      (8)启动redis客户端,查看1号库是否有数据

redis-cli --raw
#选择1号库
select 1

  优化2: 异步查询

    在Flink 流处理过程中,经常需要和外部系统进行交互,用维度表补全事实表中的字段。例如:在电商场景中,需要一个商品的skuid去关联商品的一些属性,例如商品所属行业、商品的生产厂家、生产厂家的一些情况;在物流场景中,知道包裹id,需要去关联包裹的行业属性、发货信息、收货信息等等。

    默认情况下,在Flink的MapFunction中,单个并行只能用同步方式去交互:将请求发送到外部存储,IO阻塞,等待请求返回,然后继续发送下一个请求。这种同步交互的方式往往在网络等待上就耗费了大量时间。为了提高处理效率,可以增加MapFunction的并行度,但增加并行度就意味着更多的资源,并不是一种非常好的解决方式。

    Flink 在1.2中引入了Async I/O,在异步模式下,将IO操作异步化,单个并行可以连续发送多个请求,哪个请求先返回就先处理,从而在连续的请求间不需要阻塞式等待,大大提高了流处理效率。

    Async I/O 是阿里巴巴贡献给社区的一个呼声非常高的特性,解决与外部系统交互时网络延迟成为了系统瓶颈的问题。

    异步查询实际上是把维表的查询操作托管给单独的线程池完成,这样不会因为某一个查询造成阻塞,单个并行可以连续发送多个请求,提高并发效率。这种方式特别针对涉及网络IO的操作,减少因为请求等待带来的消耗。

    使用异步API的先决条件:正确地实现数据库(或键/值存储)的异步 I/O 交互需要支持异步请求的数据库客户端。许多主流数据库都提供了这样的客户端。如果没有这样的客户端,可以通过创建多个客户端并使用线程池处理同步调用的方法,将同步客户端转换为有限并发的客户端。然而,这种方法通常比正规的异步客户端效率低。Phoenix目前没有提供异步的客户端, 所以只能通过创建多个客户端并使用线程池处理同步调用的方法,将同步客户端转换为有限并发的客户端

    Flink的异步I/O API:Flink 的异步 I/O API 允许用户在流处理中使用异步请求客户端。API 处理与数据流的集成,同时还能处理好顺序、事件时间和容错等。在具备异步数据库客户端的基础上,实现数据流转换操作与数据库的异步 I/O 交互需要以下三部分:

      1)实现分发请求的 AsyncFunction

      2)获取数据库交互的结果并发送给 ResultFuture 的 回调函数

      3)将异步 I/O 操作应用于 DataStream 作为 DataStream 的一次转换操作。

    异步API的实现代码:

      1)在Constant中添加常量

public static final String TOPIC_DWM_ORDER_WIDE = "dwm_order_wide";

      2)新建ThreadUtil,获取线程池

package com.yuange.flinkrealtime.util;

import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

/**
 * @作者:袁哥
 * @时间:2021/8/3 16:44
 */
public class ThreadUtil {

    public static ThreadPoolExecutor getThreadPool(){
        return new ThreadPoolExecutor(
                100,    //核心线程数
                300,    //最大上限
                1,      //保留时间
                TimeUnit.MINUTES,
                new LinkedBlockingQueue<>(100)  //超过上限之后的线程存储到这个队列中
        );
    }
}

      3)新建DimAsyncFunction,让其继承异步RichAsyncFunction类,并将参数设置为泛型

package com.yuange.flinkrealtime.function;

import com.yuange.flinkrealtime.util.JdbcUtil;
import com.yuange.flinkrealtime.util.RedisUtil;
import com.yuange.flinkrealtime.util.ThreadUtil;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import redis.clients.jedis.Jedis;

import java.sql.Connection;
import java.util.concurrent.ThreadPoolExecutor;

/**
 * @作者:袁哥
 * @时间:2021/8/3 16:43
 */
public abstract class DimAsyncFunction<T> extends RichAsyncFunction<T,T> {
    ThreadPoolExecutor threadPool;
    Connection phoenixConnection;
    @Override
    public void open(Configuration parameters) throws Exception {
        threadPool = ThreadUtil.getThreadPool();
        phoenixConnection = JdbcUtil.getPhoenixConnection();
    }

    @Override
    public void close() throws Exception {
        if (phoenixConnection != null) {
            phoenixConnection.close();
        }

        if (threadPool != null) {
            threadPool.shutdown();
        }
    }

    @Override
    public void timeout(Object input, ResultFuture resultFuture) throws Exception {
        // 超时的时候回调这个方法
        System.out.println("超时: " + input);
    }

    public abstract void addDim(Connection phoenixConn,
                                Jedis client,
                                T input,
                                ResultFuture<T> resultFuture) throws Exception;

    @Override
    public void asyncInvoke(T input,
                            ResultFuture<T> resultFuture) throws Exception {
        // 客户端需要有支持异步的api, 如果没有, 则可以使用多线程来完成
        threadPool.submit(() -> {
            // 读取维度操作放在这里就可以了
            // redis在异步使用的时候, 必须每个操作单独得到一个客户端
            Jedis client = RedisUtil.getRedisClient();
            // 和具体的业务相关
            try {
                addDim(phoenixConnection, client, input, resultFuture);
            } catch (Exception e) {
                e.printStackTrace();
                throw new RuntimeException("异步执行的时候的异常, " +
                        "请检测异步操作: Phoenix是否开启, redis是否开启, hadoop是否开启, maxwell是否开启 ....");
            }
            client.close();
        });
    }
}

      4)新建DwmOrderWideApp_Cache_Async,使用异步方式处理数据,并将数据写入Kafka中

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.yuange.flinkrealtime.app.BaseAppV2;
import com.yuange.flinkrealtime.bean.OrderDetail;
import com.yuange.flinkrealtime.bean.OrderInfo;
import com.yuange.flinkrealtime.bean.OrderWide;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.function.DimAsyncFunction;
import com.yuange.flinkrealtime.util.DimUtil;
import com.yuange.flinkrealtime.util.FlinkSinkUtil;
import com.yuange.flinkrealtime.util.JdbcUtil;
import com.yuange.flinkrealtime.util.RedisUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import redis.clients.jedis.Jedis;

import java.sql.Connection;
import java.time.Duration;
import java.util.Collections;
import java.util.Map;
import java.util.concurrent.TimeUnit;

/**
 * @作者:袁哥
 * @时间:2021/8/2 21:18
 */
public class DwmOrderWideApp_Cache_Async extends BaseAppV2 {

    public static void main(String[] args) {
        new DwmOrderWideApp_Cache_Async().init(
                3003,
                1,
                "DwmOrderWideApp",
                "DwmOrderWideApp",
                Constant.TOPIC_DWD_ORDER_INFO, Constant.TOPIC_DWD_ORDER_DETAIL
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams) {
        // 1. 事实表进行join
        SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator = factJoin(streams);
        // 2. join的维度数据
        SingleOutputStreamOperator<OrderWide> orderWideStreamWithDim = dimJoin(orderWideSingleOutputStreamOperator);
        orderWideStreamWithDim.print();
        // 3. 把宽表写入到dwm层(kafka)
        sendToKafka(orderWideStreamWithDim);
    }

    private void sendToKafka(SingleOutputStreamOperator<OrderWide> stream) {
        stream.map(JSON::toJSONString)  //将stream转化为JSON
                .addSink(FlinkSinkUtil.getKafkaSink(Constant.TOPIC_DWM_ORDER_WIDE));
    }

    private SingleOutputStreamOperator<OrderWide> dimJoin(SingleOutputStreamOperator<OrderWide> orderWideSingleOutputStreamOperator) {
        return AsyncDataStream.unorderedWait(
                orderWideSingleOutputStreamOperator,    //需要进行异步处理的流
                new DimAsyncFunction<OrderWide>() {

                    @Override
                    public void addDim(Connection phoenixConn,
                                       Jedis redisClient,
                                       OrderWide wide,
                                       ResultFuture<OrderWide> resultFuture) throws Exception {
                        // 1. 补充 dim_user_info  select * from t where id=?
                        JSONObject userInfo = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_USER_INFO, wide.getUser_id().toString());
                        wide.setUser_gender(userInfo.getString("GENDER"));
                        wide.setUser_age(userInfo.getString("BIRTHDAY"));

                        // 2. 省份
                        JSONObject baseProvince = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_BASE_PROVINCE, wide.getProvince_id().toString());
                        wide.setProvince_3166_2_code(baseProvince.getString("ISO_3166_2"));
                        wide.setProvince_area_code(baseProvince.getString("AREA_CODE"));
                        wide.setProvince_iso_code(baseProvince.getString("ISO_CODE"));
                        wide.setProvince_name(baseProvince.getString("NAME"));

                        // 3. sku
                        JSONObject skuInfo = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_SKU_INFO, wide.getSku_id().toString());
                        wide.setSku_name(skuInfo.getString("SKU_NAME"));

                        wide.setSpu_id(skuInfo.getLong("SPU_ID"));
                        wide.setTm_id(skuInfo.getLong("TM_ID"));
                        wide.setCategory3_id(skuInfo.getLong("CATEGORY3_ID"));

                        // 4. spu
                        JSONObject spuInfo = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_SPU_INFO, wide.getSpu_id().toString());
                        wide.setSpu_name(spuInfo.getString("SPU_NAME"));
                        // 5. tm
                        JSONObject tmInfo = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_BASE_TRADEMARK, wide.getTm_id().toString());
                        wide.setTm_name(tmInfo.getString("TM_NAME"));

                        // 5. c3
                        JSONObject c3Info = DimUtil.readDim(phoenixConn, redisClient, Constant.DIM_BASE_CATEGORY3, wide.getCategory3_id().toString());
                        wide.setCategory3_name(c3Info.getString("NAME"));

                        resultFuture.complete(Collections.singletonList(wide));
                    }
                },  //异步处理函数
                60, //每个异步操作的超时时间
                TimeUnit.SECONDS    //超时时间的代码
        );
    }

    private SingleOutputStreamOperator<OrderWide> factJoin(Map<String, DataStreamSource<String>> streams) {
        KeyedStream<OrderInfo, Long> orderInfoStream = streams.get(Constant.TOPIC_DWD_ORDER_INFO)
                .map(info -> JSON.parseObject(info, OrderInfo.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((info, ts) -> info.getCreate_ts())
                )
                .keyBy(OrderInfo::getId);

        KeyedStream<OrderDetail, Long> orderDetailStream = streams.get(Constant.TOPIC_DWD_ORDER_DETAIL)
                .map(info -> JSON.parseObject(info, OrderDetail.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((detail, ts) -> detail.getCreate_ts())
                )
                .keyBy(OrderDetail::getOrder_id);

        return orderInfoStream.intervalJoin(orderDetailStream)
                .between(Time.seconds(-5),Time.seconds(5))
                .process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
                    @Override
                    public void processElement(OrderInfo left,
                                               OrderDetail right,
                                               Context ctx,
                                               Collector<OrderWide> out) throws Exception {
                        out.collect(new OrderWide(left, right));
                    }
                });
    }
}

      5)将flink-realtime打包并上传至/opt/module/applog

      6)修改realtime.sh启动脚本

vim /home/atguigu/bin/realtime.sh
#!/bin/bash
flink=/opt/module/flink-yarn/bin/flink
jar=/opt/module/applog/flink-realtime-1.0-SNAPSHOT.jar

apps=(
        com.yuange.flinkrealtime.app.dwd.DwdLogApp
        com.yuange.flinkrealtime.app.dwd.DwdDbApp
        com.yuange.flinkrealtime.app.dwm.DwmUvApp
        #com.yuange.flinkrealtime.app.dwm.DwmJumpDetailApp_Two
        #com.yuange.flinkrealtime.app.dwm.DwmOrderWideApp_Cache
        com.yuange.flinkrealtime.app.dwm.DwmOrderWideApp_Cache_Async
)

for app in ${apps[*]} ; do
        $flink run -d -c $app $jar
done

      7)启动yarn-session

/opt/module/flink-yarn/bin/yarn-session.sh -d

      8)清空redis中的1号库

flushdb

      9)运行realtime.sh脚本,将程序提交至yarn-session上运行

realtime.sh

      10)启动一个消费者,测试数据是否到达kafka

consume dwm_order_wide

      11)生产业务数据,模拟新增

cd /opt/software/mock/mock_db
java -jar gmall2020-mock-db-2020-12-23.jar

      12)查看redis是否有数据

      13)查看消费情况

第5章 DWM层: 支付宽表

5.1 需求分析与思路

  支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上,没有办法统计商品级别的支付状况。所以本次宽表的核心就是要把支付表的信息与订单明细关联上。

  解决方案有三个

    1)一个是把订单明细表(或者宽表)输出到Hbase上,在支付宽表计算时查询hbase,这相当于把订单明细作为一种维度进行管理。

    2)一个是用流的方式接收订单明细,然后用双流join方式进行合并。因为订单与支付产生有一定的时差。所以必须用intervalJoin来管理流的状态时间,保证当支付到达时订单明细还保存在状态中。

    3)使用流的方式让支付表和订单宽表进行join, 就省去了查询维度表的步骤

5.2 具体实现代码

5.2.1 用到POJO

  1)支付实体类

package com.yuange.flinkrealtime.bean;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.math.BigDecimal;

/**
 * @作者:袁哥
 * @时间:2021/8/3 18:28
 */
@NoArgsConstructor
@AllArgsConstructor
@Data
public class PaymentInfo {
    private Long id;
    private Long order_id;
    private Long user_id;
    private BigDecimal total_amount;
    private String subject;
    private String payment_type;
    private String create_time;
    private String callback_time;
}

  2)支付宽表实体类

package com.yuange.flinkrealtime.bean;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.commons.beanutils.BeanUtils;

import java.lang.reflect.InvocationTargetException;
import java.math.BigDecimal;

/**
 * @作者:袁哥
 * @时间:2021/8/3 18:29
 */
@NoArgsConstructor
@AllArgsConstructor
@Data
public class PaymentWide {
    private Long payment_id;
    private String subject;
    private String payment_type;
    private String payment_create_time;
    private String callback_time;
    private Long detail_id;
    private Long order_id;
    private Long sku_id;
    private BigDecimal order_price;
    private Long sku_num;
    private String sku_name;
    private Long province_id;
    private String order_status;
    private Long user_id;
    private BigDecimal total_amount;
    private BigDecimal activity_reduce_amount;
    private BigDecimal coupon_reduce_amount;
    private BigDecimal original_total_amount;
    private BigDecimal feight_fee;
    private BigDecimal split_feight_fee;
    private BigDecimal split_activity_amount;
    private BigDecimal split_coupon_amount;
    private BigDecimal split_total_amount;
    private String order_create_time;

    private String province_name;//查询维表得到
    private String province_area_code;
    private String province_iso_code;
    private String province_3166_2_code;
    private Integer user_age;
    private String user_gender;

    private Long spu_id;     //作为维度数据 要关联进来
    private Long tm_id;
    private Long category3_id;
    private String spu_name;
    private String tm_name;
    private String category3_name;

    public PaymentWide(PaymentInfo paymentInfo, OrderWide orderWide) {
        mergeOrderWide(orderWide);
        mergePaymentInfo(paymentInfo);

    }

    public void mergePaymentInfo(PaymentInfo paymentInfo) {
        if (paymentInfo != null) {
            try {
                BeanUtils.copyProperties(this, paymentInfo);
                payment_create_time = paymentInfo.getCreate_time();
                payment_id = paymentInfo.getId();
            } catch (IllegalAccessException | InvocationTargetException e) {
                e.printStackTrace();
            }
        }
    }

    public void mergeOrderWide(OrderWide orderWide) {
        if (orderWide != null) {
            try {
                BeanUtils.copyProperties(this, orderWide);
                order_create_time = orderWide.getCreate_time();
            } catch (IllegalAccessException | InvocationTargetException e) {
                e.printStackTrace();
            }
        }
    }
}

5.2.2 join代码

  1)在Constant中添加常量

public static final String TOPIC_DWD_PAYMENT_INFO = "dwd_payment_info";
public static final String TOPIC_DWM_PAYMENT_WIDE = "dwm_payment_wide";

  2)在YuangeCommonUtil工具类中添加toTs()方法,将一个String类型的时间转为Long类型的时间戳

public static long toTs(String create_time) {
        try {
            return new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(create_time).getTime();
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return 0L;
    }

  3)新建DwmPaymentWideApp类

package com.yuange.flinkrealtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.yuange.flinkrealtime.app.BaseAppV2;
import com.yuange.flinkrealtime.bean.OrderWide;
import com.yuange.flinkrealtime.bean.PaymentInfo;
import com.yuange.flinkrealtime.bean.PaymentWide;
import com.yuange.flinkrealtime.common.Constant;
import com.yuange.flinkrealtime.util.FlinkSinkUtil;
import com.yuange.flinkrealtime.util.YuangeCommonUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.Map;

/**
 * @作者:袁哥
 * @时间:2021/8/3 18:31
 */
public class DwmPaymentWideApp extends BaseAppV2 {

    public static void main(String[] args) {
        new DwmPaymentWideApp().init(
                3004,
                1,
                "DwmPaymentWideApp",
                "DwmPaymentWideApp",
                Constant.TOPIC_DWD_PAYMENT_INFO, Constant.TOPIC_DWM_ORDER_WIDE
        );
    }

    @Override
    protected void run(StreamExecutionEnvironment environment, Map<String, DataStreamSource<String>> streams) {
        KeyedStream<PaymentInfo, Long> paymentInfoStream = streams.get(Constant.TOPIC_DWD_PAYMENT_INFO)
                .map(s -> JSON.parseObject(s, PaymentInfo.class))   //将dwd_payment_info主题中的数据转为对象
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<PaymentInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((element, recordTimestamp) -> YuangeCommonUtil.toTs(element.getCreate_time()))
                )
                .keyBy(PaymentInfo::getOrder_id);

        KeyedStream<OrderWide, Long> orderWideStream = streams.get(Constant.TOPIC_DWM_ORDER_WIDE)
                .map(s -> JSON.parseObject(s, OrderWide.class))
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<OrderWide>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((element, recordTimestamp) -> YuangeCommonUtil.toTs(element.getCreate_time()))
                )
                .keyBy(OrderWide::getOrder_id);

        paymentInfoStream.intervalJoin(orderWideStream)
                .between(Time.minutes(-45), Time.seconds(10))
                .process(new ProcessJoinFunction<PaymentInfo, OrderWide, PaymentWide>() {
                    @Override
                    public void processElement(PaymentInfo left,
                                               OrderWide right,
                                               Context ctx,
                                               Collector<PaymentWide> out) throws Exception {
                        out.collect(new PaymentWide(left,right));
                    }
                })
                .map(t->JSON.toJSONString(t))
                .addSink(FlinkSinkUtil.getKafkaSink(Constant.TOPIC_DWM_PAYMENT_WIDE));
    }
}

  4)承接前面启动过的程序,不用停止它们

  5)打包上传至Linux

  6)提交至yarn-session上运行,查看kafka中是否有dwm_payment_wide主题,以及是否有数据

第6章 总结

  DWM层部分的代码主要的责任,是通过计算把一种明细转变为另一种明细以应对后续的统计。学完本阶段内容要求掌握

  1)学会利用状态(state)进行去重操作。(需求:UV计算)

  2)学会利用CEP可以针对一组数据进行筛选判断。需求:跳出行为计算

  3)学会使用intervalJoin处理流join

  4)学会处理维度关联,并通过缓存和异步查询对其进行性能优化。

posted on 2021-08-03 18:51  LZ名約山炮  阅读(742)  评论(0编辑  收藏  举报