工作中实现了一张表的设计并且数据也已经跑出来了,接下来怎么对这张表做数据质量的监控,来防止数据丢失/错误的出现?
下面列几个常用的例子:
1.DQ:PK duplicate check, pk=slr_id+dt
with t as (
select slr_id, dt, count(1)
from slr_base_dtl
group by 1,2
)
select assert_true(count(1) = 0, 'pk duplicate')
from t
where min_dt <> '2025-01-01';
2. DQ: 假设对每个seller的起始时间是2025-01-01,for each seller, min dt should be 2025-01-01
with t as (
select slr_id, min(dt) as min_dt
from slr_base_dtl
group by 1
)
select assert_true(count(1) = 0, 'There are sellers with min dt not equal to 2023-06-01')
from t
where min_dt <> '2025-01-01';
3.DQ: 假设每个seller有最近两年每天的明细数据,for each seller, distinct dt should be max(dt) - min(dt)+1
with t as (
select slr_id, min(dt) as min_dt, max(dt) as max_dt, count(DISTINCT dt) as cnt
from slr_base_dtl
group by 1
)
select assert_true(count(1) = 0, 'There are sellers with distinct dt not equal to max(dt) - min(dt)')
from t
where cnt <> datediff(max_dt, min_dt) + 1;
浙公网安备 33010602011771号