[Clickhouse] Clickhouse 特性 : WITH FILL(缺失段数据补齐/自定义变量)
1 背景引入
背景需求: 缺失数据的补齐
- 要聚合查询一些数据,按照时间 15 分钟聚合数据量 或 根据用户提供的入参startTime/endTime动态筛选指定的若干月份聚合数据。但是由于某些时间段(如:某些月份)数据缺失,一些时段值没有,前端显示就会有问题,需要数据接口开发者自己插值解决
- 有一种实现方式: 在查询出库后,在服务端使用代码补全这些数据。
但是觉得这样有点 LOW,应该有更好的方法
- 经过一番调研,发现 Clickhouse 的
WITH FILL正符合需求
2 WITH FILL 简介
WITH FILL修饰符配合ORDER BY使用,就是为了按照指定顺序填值————有值跳过,无值插入- 可以在
ORDER BY expr之后用可选的FROM expr、TO expr和STEP expr参数来设置WITH FILL修饰符
from、to确定补值的范围,step是步进。若没有设置,则使用默认值
- 针对
DateTime类型,默认步进单位是 : 1秒- 针对
Date类型,默认步进单位是 : 1天- 针对
数字类型,默认步进单位是 : 1.0
- 所有
expr列的缺失值将被顺序填充,而其他列将被填充为默认值
使用以下语法填充多列,在ORDER BY部分的每个字段名称后添加带有可选参数的WITH FILL修饰符。
ORDER BY expr
[WITH FILL]
[FROM const_expr] [TO const_expr]
[STEP const_numeric_expr], ... exprN
[WITH FILL]
[FROM expr] [TO expr]
[STEP numeric_expr]
WITH FILL可以针对单字段使用,也可以多字段使用,但是仅适用于具有数字(所有类型的浮点,小数,整数)或日期/日期时间类型的字段- 当未定义
FROM const_expr填充顺序时,则使用ORDER BY中的最小expr字段值 - 当定义了
STEP const_numeric_expr时,不同类型(就3种)表示不一样:
- 对于数字类型,
const_numeric_expr为具体数值;- 当作为日期类型(Date),
const_numeric_expr为多少天;- 当作为日期时间(DateTime)类型时,
const_numeric_expr为多少秒。
- 如果省略了
STEP const_numeric_expr,则:填充顺序使用1.0表示数字类型,1 day表示日期类型,1 second表示日期时间类型。
3 应用场景 for 缺失段数据补齐
CASE: 单字段、基于数字、按月份逐一补0
with tmp_device_active_days as ( -- 设备每月活跃日数据集 (模拟从数据库查出的初始数据)
-- 动态计算指定月份的总天数 : toDayOfMonth(subtractDays(addMonths(toStartOfMonth( monthFirstDay ), 1), 1) ) as totalMonthDays
select 'XXXX001' as deviceId, toDate('2023-09-01') as monthFirstDay , 30 as totalMonthDays , 17 as realActiveMonthDays
union all
select 'XXXX001' as deviceId, toDate('2023-10-01') as monthFirstDay , 31 as totalMonthDays , 14 as realActiveMonthDays
union all
select 'XXXX002' as deviceId, toDate('2024-09-01') as monthFirstDay , 30 as totalMonthDays , 1 as realActiveMonthDays
union all
select 'XXXX003' as deviceId, toDate('2023-06-01') as monthFirstDay , 30 as totalMonthDays , 2 as realActiveMonthDays
union all
select 'XXXX004' as deviceId, toDate('2023-08-01') as monthFirstDay , 31 as totalMonthDays , 5 as realActiveMonthDays
union all
select 'XXXX005' as deviceId, toDate('2023-05-01') as monthFirstDay , 31 as totalMonthDays , 8 as realActiveMonthDays
union all
select 'XXXX006' as deviceId, toDate('2023-12-01') as monthFirstDay , 31 as totalMonthDays , 17 as realActiveMonthDays
union all
select 'XXXX007' as deviceId, toDate('2024-08-01') as monthFirstDay , 31 as totalMonthDays , 2 as realActiveMonthDays
union all
select 'XXXX008' as deviceId, toDate('2024-07-01') as monthFirstDay , 30 as totalMonthDays , 10 as realActiveMonthDays
union all
select 'XXXX009' as deviceId, toDate('2024-02-01') as monthFirstDay , 30 as totalMonthDays , 10 as realActiveMonthDays
)
select
formatDateTime( toStartOfMonth( addMonths( toDate('1970-01-01') , relativeMonth ) ) , '%Y%m' , 'Asia/Shanghai' ) as `date`
, toUInt64(0) as newDevices
, activeDevices
FROM (
SELECT
relativeMonth
, max(activeVehicles) as activeVehicles
from (
-- 构造 2 条起止时间的空数据,用于外层统计结果的自动填充
SELECT dateDiff('month', toDate('1970-01-01'), toStartOfMonth( toDate( toDateTime64('{{startTime}}', 3 ,'{{timeZone}}') ) ) ) as relativeMonth, 0 as activeDevices
UNION ALL
SELECT dateDiff('month', toDate('1970-01-01'), toStartOfMonth( toDate( toDateTime64('{{endTime}}', 3 ,'{{timeZone}}') ) ) ) as relativeMonth, 0 as activeDevices
UNION ALL
SELECT
relativeMonth
, count(deviceId) as activeDevices
FROM (
-- select
-- '' as deviceId
-- , toStartOfMonth( toDate( toDateTime64('{{startTime}}', 3 ,'{{timeZone}}') ) ) as `monthFirstDay`
-- , toDayOfMonth(subtractDays(addMonths(toStartOfMonth( monthFirstDay ), 1), 1) ) as totalMonthDays
-- , 0 as realActiveMonthDays
-- , dateDiff('month', toDate('1970-01-01'), monthFirstDay ) as relativeMonth
-- union all
-- select
-- '' as deviceId
-- , toStartOfMonth( toDate( toDateTime64('{{endTime}}', 3 ,'{{timeZone}}') ) ) as `monthFirstDay`
-- , toDayOfMonth(subtractDays(addMonths(toStartOfMonth( monthFirstDay ), 1), 1) ) as totalMonthDays
-- , 0 as realActiveMonthDays
-- , dateDiff('month', toDate('1970-01-01'), monthFirstDay ) as relativeMonth
-- union all
SELECT
*
, dateDiff('month', toDate('1970-01-01'), monthFirstDay ) as relativeMonth
FROM tmp_device_active_days
-- where 1 = 1 and ( (realActiveMonthDays > 5 ) or (deviceId = '') )
where 1 = 1 and (realActiveMonthDays > 5 ) -- 每月活跃天数 > 5 天的设备 -- realActiveMonthDays = totalMonthDays
) y
group by relativeMonth
) x
group by relativeMonth
order by relativeMonth ASC
WITH FILL
-- dateDiff 计算结果是 数字(距 1970-01-01 的月份数,以此实现对缺数据月份的自动补齐)
FROM dateDiff('month', toDate('1970-01-01') , toDate( toDateTime64('{{startTime}}', 3 ,'{{timeZone}}') ) ) TO dateDiff('month', toDate('1970-01-01'), toDate( toDateTime64('{{endTime}}', 3 ,'{{timeZone}}') ) )
STEP 1
)
- 用户传入的动态参数
startTime, 时间戳、单位: 毫秒(13位)。如: 1684857600000endTime, 时间戳、单位: 毫秒(13位)。如 : 1795462399999timeZone, 时区,如:"Asia/Shanghai"
- output : 不加 WITH FILL 时
-- order by relativeMonth ASC
-- WITH FILL
--
-- FROM dateDiff('month', toDate('1970-01-01') , toDate( toDateTime64('1684857600000', 3 ,'Asia/Shanghai') ) ) TO dateDiff('month', toDate('1970-01-01'), toDate( toDateTime64('1795462399999', 3 ,'Asia/Shanghai') ) )
-- STEP 1
date |newDevices|activeDevices|
------+----------+-------------+
202407| 0| 1|
202402| 0| 1|
202305| 0| 1|
202312| 0| 1|
202310| 0| 1|
202309| 0| 1|
- output : 加 WITH FILL 时
date |newDevices|activeDevices|
------+----------+-------------+
202305| 0| 1|
202306| 0| 0|
202307| 0| 0|
202308| 0| 0|
202309| 0| 1|
202310| 0| 1|
202311| 0| 0|
202312| 0| 1|
202401| 0| 0|
202402| 0| 1|
202403| 0| 0|
202404| 0| 0|
202405| 0| 0|
202406| 0| 0|
202407| 0| 1|
202408| 0| 0|
202409| 0| 0|
202410| 0| 0|
202411| 0| 0|
202412| 0| 0|
202501| 0| 0|
202502| 0| 0|
202503| 0| 0|
202504| 0| 0|
202505| 0| 0|
202506| 0| 0|
202507| 0| 0|
202508| 0| 0|
202509| 0| 0|
202510| 0| 0|
202511| 0| 0|
202512| 0| 0|
202601| 0| 0|
202602| 0| 0|
202603| 0| 0|
202604| 0| 0|
202605| 0| 0|
202606| 0| 0|
202607| 0| 0|
202608| 0| 0|
202609| 0| 0|
202610| 0| 0|
CASE : 单字段、基于 DateTime 、按N分钟间隔时段补0
- 此处是对表passing_vehicle 进行数据统计,把一天按照15分钟间隔,获取96条过车数据
- toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2是对时间戳进行聚合,按照15分钟间隔
- GROUP by time_stamp2,approach是按照15分钟和进口聚合
- 一开始的SQL是 GROUP by time_stamp2,approach order by time_stamp2,approach,查询条件也没有and approach = 'SB',结果插值补全有问题,多字段聚合排序,只补值一个字段,得到的不是自己的想要的结果
SELECT
toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2 ,
approach,lane_nbr,
count() as totalVolume
from passing_vehicle pv
WHERE
intersection_number = 1687001
and time_stamp > '2023-05-08 00:00:00'
and time_stamp < '2023-05-09 00:00:00'
and approach = 'SB'
and status = 1
GROUP by time_stamp2
order by time_stamp2
WITH FILL
FROM toDateTime('2023-05-08 00:00:00') TO toDateTime('2023-05-09 00:00:00')
STEP 15*60
- 现在这条SQL能正常运行和补值,它查询的是,某路口南进口(SB)的15分钟过车流量统计
- FROM toDateTime('2023-05-08 00:00:00') TO toDateTime('2023-05-09 00:00:00') 是时间范围,这个和我的查询时间是对应的,这个范围内补全。注意这个类型,一定要是时间日期,即使用toDateTime函数,将字符串转成日期时间类型
- STEP 15*60,是将插值补全,步进为15分钟,日期时间类型,默认是1s,15分钟乘以对应秒数即可
注: 如果是要按照一天(或n天)去做时间聚合统计,就需要使用
toDate函数,相应步进为n天
CASE : 多字段补全
- 进阶,按照进口查询多次,多个字段补全
SELECT
toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2,
lane_nbr,
sum(status) as totalVolume
from passing_vehicle pv
WHERE 1 = 1
and intersection_number = 1687001
and time_stamp >= '2023-05-08 00:00:00'
and time_stamp <= '2023-05-08 01:59:59'
and approach = 'NB'
GROUP by time_stamp2, lane_nbr
order by time_stamp2
WITH FILL
FROM toDateTime('2023-05-08 00:00:00') TO toDateTime('2023-05-08 01:59:59')
STEP 15 * 60,
-- 注意:这里的6,是传参进来的,因为是5车道,lane_nbr需要补全1-5(最大是5,范围要是6才行)
lane_nbr
WITH FILL
FROM 1 TO 6
STEP 1
- 这条
SQL查询的是,某路口北进口(NB)各个车道15分钟过车数量统计 GROUP by time_stamp2,lane_nbr,聚合条件是15分钟,和车道编号time_stamp2 WITH FILL与上面一样,按照15分钟补全lane_nbr WITH FILL FROM 1 TO 6 STEP 1是按照车道编号补全,注意to的值要比最大值大1- 这2个
WITH FILL一起使用,就会把时间和车道编号都补全,得到我们想要的结果 - 对于非数值/日期/日期时间类型的字段,如果是枚举类型字符串类型字段,也可以使用补全,将该字段的枚举值,存成连续新增的数值即可
4 应用场景 for WITH的其他用处
CASE 自定义变量
-- 查询离线天数在 20-30 天的
WITH
-- 动态定义变量
splitByString('-', '20-30')[1] AS offlineDaysStart1,
splitByString('-', '20-30')[2] AS offlineDaysEnd1,
splitByString('-', '35-40')[1] AS offlineDaysStart2,
splitByString('-', '35-40')[2] AS offlineDaysEnd2
select
*
from (
select 19 as offlineDays, 'jack' as name
union all
select 29 as offlineDays, 'jane' as name
union all
select 21 as offlineDays, 'mike' as name
union all
select 34 as offlineDays, 'bill' as name
union all
select 35 as offlineDays, 'bill' as name
)
where 1=1
and ( offlineDays >= offlineDaysStart1 and offlineDays < offlineDaysEnd1)
or ( offlineDays >= offlineDaysStart2 and offlineDays < offlineDaysEnd2 )
out
offlineDays|name|
-----------+----+
29|jane|
21|mike|
35|bill|
X 参考文献
- clickhouse
本文作者:
千千寰宇
本文链接: https://www.cnblogs.com/johnnyzen
关于博文:评论和私信会在第一时间回复,或直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
日常交流:大数据与软件开发-QQ交流群: 774386015 【入群二维码】参见左下角。您的支持、鼓励是博主技术写作的重要动力!
本文链接: https://www.cnblogs.com/johnnyzen
关于博文:评论和私信会在第一时间回复,或直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
日常交流:大数据与软件开发-QQ交流群: 774386015 【入群二维码】参见左下角。您的支持、鼓励是博主技术写作的重要动力!

浙公网安备 33010602011771号