拉链表及实现

拉链表定义:
存储的是用户的最基本信息以及每条记录的生命周期(每日增量的被修改的数据)。我们可以使用这张表拿到当天的最新数据以及之前的历史数据。

如果不用拉链表,储存全量数据的方法有:
1、每天抽取只留最新的一份数据,则丢失了历史数据中,修改的信息;
2、每天保留一份全量的切片数据,每次全量中会保存很多不变的信息,对存储是极大的浪费

hive中设计拉链表:
数据表ods.user + 数据更新表ods.user_update +拉链表初始化第一天的数据dws.user_his + t_start_date + t_end_date
现在假设初始化好了user_his后,对其更新2017-01-02的数据进来:

CREATE EXTERNAL TABLE dws.user_his (
user_num STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
reg_date STRING COMMENT '用户编号',
t_start_date ,
t_end_date
COMMENT '用户资料拉链表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/dws/user_his';
)
INSERT OVERWRITE TABLE dws.user_his
SELECT * FROM
(
SELECT A.user_num,
A.mobile,
A.reg_date,
A.t_start_time,
CASE
WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01'
ELSE A.t_end_time
END AS t_end_time
FROM dws.user_his AS A
LEFT JOIN ods.user_update AS B
ON A.user_num = B.user_num
UNION
SELECT C.user_num,
C.mobile,
C.reg_date,
'2017-01-02' AS t_start_time,
'9999-12-31' AS t_end_time
FROM ods.user_update AS C
) AS T

posted @ 2021-06-29 21:09  renana  阅读(965)  评论(0)    收藏  举报