dbt dbt-audit-helper 包提供的一些方便macro

dbt-audit-helper 从字面意思是dbt 的审计帮助工具,但是实际上我们也可以使用此工具做一些数据质量相关的东西

dbt-audit-helper 提供的macro

  • 比较数据输出
    包含了compare_relations,compare_queries,compare_row_counts
  • 比较列
    compare_column_values, cmpare_all_columns, compare_which_columns_differ, compare_relation_columns
  • 以及打印输出到日志&输出到自定义的测试中

参考使用

  • compare_queries
{% set old_query %}
  select
    id as order_id,
    amount,
    customer_id
  from old_database.old_schema.fct_orders
{% endset %}
 
{% set new_query %}
  select
    order_id,
    amount,
    customer_id
  from {{ ref('fct_orders') }}
{% endset %}
 
{{ audit_helper.compare_queries(
    a_query = old_query,
    b_query = new_query,
    primary_key = "order_id"
) }}
  • 输出结果
{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}
 
{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}
 
{% set audit_query = audit_helper.compare_column_values(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    column_to_compare = "status"
) %}
 
{% set audit_results = run_query(audit_query) %}
 
{% if execute %}
{% do audit_results.print_table() %}
{% endif %}
  • 集成测试的
{% set old_relation = ref('stg_customers') %}
 
{% set dbt_relation = ref('customers') %}
 
{{ audit_helper.compare_relations(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key = "customer_id"
) }}
 
where percent_of_total < 100
  • 测试存储其他地方
{{config(schema='test_appv2')}}
 
{% set old_relation = ref('stg_customers') %}
 
{% set dbt_relation = ref('customers') %}
 
{{ audit_helper.compare_relations(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key = "customer_id"
)}}
where percent_of_total = 100

内部实现简单说明

只说明一个简单的compare_queries

  • compare_queries 内部实现

从处理上核心是对于sql 查询进行数据集的交叉并处理,通过分析包含的相同数据,之后提供结果信息的汇总

{% macro compare_queries(a_query, b_query, primary_key=None, summarize=true, limit=None) -%}
  {{ return(adapter.dispatch('compare_queries', 'audit_helper')(a_query, b_query, primary_key, summarize, limit)) }}
{%- endmacro %}
 
{% macro default__compare_queries(a_query, b_query, primary_key=None, summarize=true, limit=None) %}
 
with a as (
 
    {{ a_query }}
 
),
 
b as (
 
    {{ b_query }}
 
),
 
a_intersect_b as (
 
    select * from a
    {{ dbt.intersect() }}
    select * from b
 
),
 
a_except_b as (
 
    select * from a
    {{ dbt.except() }}
    select * from b
 
),
 
b_except_a as (
 
    select * from b
    {{ dbt.except() }}
    select * from a
 
),
 
all_records as (
 
    select
        *,
        true as in_a,
        true as in_b
    from a_intersect_b
 
    union all
 
    select
        *,
        true as in_a,
        false as in_b
    from a_except_b
 
    union all
 
    select
        *,
        false as in_a,
        true as in_b
    from b_except_a
 
),
 
{%- if summarize %}
 
summary_stats as (
 
    select
 
        in_a,
        in_b,
        count(*) as count
 
    from all_records
    group by 1, 2
 
),
 
final as (
 
    select
 
        *,
        round(100.0 * count / sum(count) over (), 2) as percent_of_total
 
    from summary_stats
    order by in_a desc, in_b desc
 
)
 
{%- else %}
 
final as (
 
    select * from all_records
    where not (in_a and in_b)
    order by {{ primary_key ~ ", " if primary_key is not none }} in_a desc, in_b desc
 
)
 
{%- endif %}
 
select * from final
{%- if limit and not summarize %}
limit {{ limit }}
{%- endif %}
 
 
{% endmacro %}

说明

dbt-audit-helper 是一个很不错的工具包,可以实现不错的模型check,提示数据质量以及做一些模式审计处理

参考资料

https://github.com/dbt-labs/dbt-audit-helper

posted on 2024-04-30 12:08  荣锋亮  阅读(4)  评论(0编辑  收藏  举报

导航