dbt-checkpoint 源码结构简单说明

前边说过dbt-checkpoint 是基于dbt 的元数据解析,然后集合规则进行check,属于一个pre-commit 插件,以下简单说明下内部实现

配置

核心是 .pre-commit-hooks.yaml文件,一个标准的pre-commit 定义

  • 内容

核心是id,name,entry,language,entry 实际上就是一个python entry_points 的 console_scripts

- id: check-column-desc-are-same
  name: Check column descriptions are same
  description: Check the models have same descriptions for same column names.
  entry: check-column-desc-are-same
  language: python
  files: '.*\.(yml|yaml)$'
- id: check-column-name-contract
  name: Check column name contract
  description: Check column name abides to contract.
  entry: check-column-name-contract
  language: python
  types_or: [sql, yaml]
- id: check-macro-has-description
  name: Check the macro has description
  description: Ensures that the macro has description in properties file.
  entry: check-macro-has-description
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process yaml and sql
- id: check-macro-arguments-have-desc
  name: Check the macro arguments have description
  description: Ensures that the macro has arguments with descriptions in properties file.
  entry: check-macro-arguments-have-desc
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process yaml and sql
- id: check-model-columns-have-desc
  name: Check the model columns have description
  description: Ensures that the model has columns with descriptions in properties file.
  entry: check-model-columns-have-desc
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process yaml and sql
- id: check-model-has-all-columns
  name: Check the model has all columns in properties file
  description: Ensures that all columns in database are specified in properties file.
  entry: check-model-has-all-columns
  language: python
  types_or: [sql, yaml]
- id: check-model-has-description
  name: Check the model has description
  description: Ensures that the model has description in properties file.
  entry: check-model-has-description
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process yaml and sql
- id: check-model-has-meta-keys
  name: Check the model has keys in the meta part
  description: Ensures that the model has a list of valid meta keys.
  entry: check-model-has-meta-keys
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process YAML and SQL
- id: check-model-has-labels-keys
  name: Check the model has keys in the labels part
  description: Ensures that the model has a list of valid labels keys.
  entry: check-model-has-labels-keys
  language: python
  types_or: [yaml, sql]
  require_serial: true # because we need to process YAML and SQL
- id: check-model-has-properties-file
  name: Check the model has properties file
  description: Ensures that the model has properties file (schema file).
  entry: check-model-has-properties-file
  language: python
  types_or: [sql, yaml]
- id: check-model-has-tests-by-group
  name: Check model; number of tests from group of tests.
  description: Ensures that the model has a number of tests from a group of tests.
  entry: check-model-has-tests-by-group
  language: python
  types_or: [sql, yaml]
- id: check-model-has-tests-by-name
  name: Check the model tests by a test name
  description: Ensures that the model has a number of tests of a certain name (e.g. data, unique).
  entry: check-model-has-tests-by-name
  language: python
  types_or: [sql, yaml]
- id: check-model-has-tests-by-type
  name: Check the model tests by a test type
  description: Ensures that the model has a number of tests of a certain type (data, schema).
  entry: check-model-has-tests-by-type
  language: python
  types_or: [sql, yaml]
- id: check-model-has-tests
  name: Check that model has tests
  description: Ensures that the model has a number of tests.
  entry: check-model-has-tests
  language: python
  types_or: [sql, yaml]
- id: check-model-name-contract
  name: Check model name contract
  description: Check model name abides to contract.
  entry: check-model-name-contract
  language: python
  types_or: [sql]
- id: check-model-parents-schema
  name: Check parent models/sources from certain schema
  entry: check-model-parents-schema
  language: python
  types_or: [yaml, sql]
- id: check-model-parents-database
  name: Check parent models/sources from certain database
  entry: check-model-parents-database
  language: python
  types_or: [sql, yaml]
- id: check-model-parents-and-childs
  name: Check the model has a parents/childs
  description: Ensures the model has a specific number (max/min) of parents or/and childs.
  entry: check-model-parents-and-childs
  language: python
  types_or: [sql, yaml]
- id: check-model-tags
  name: Check the model has valid tags
  description: Ensures that the model has only valid tags from the provided list.
  entry: check-model-tags
  language: python
  types_or: [sql, yaml]
- id: check-script-has-no-table-name
  name: Check the script has not table name
  description: Ensures that the script is using only source or ref macro to specify the table name.
  entry: check-script-has-no-table-name
  language: python
  types_or: [sql]
- id: check-script-ref-and-source
  name: Check the script has existing refs and sources
  description: Ensures that the script contains only existing sources or macros.
  entry: check-script-ref-and-source
  language: python
  types_or: [sql]
- id: check-script-semicolon
  name: Check the script does not contain a semicolon
  description: Ensure that the script does not have a semicolon at the end of the file.
  entry: check-script-semicolon
  language: python
  types_or: [sql]
- id: check-source-childs
  name: Check the source has max/min number of childs.
  description: Ensures the source has a specific number (max/min) of childs.
  entry: check-source-childs
  language: python
  types_or: [sql]
- id: check-source-columns-have-desc
  name: Check for source column descriptions
  description: Ensures that the source has columns with descriptions in the properties file.
  entry: check-source-columns-have-desc
  language: python
  types_or: [yaml]
- id: check-source-has-all-columns
  name: Check source has all columns in properties file
  description: Ensures that all columns in the database are specified in the properties file.
  entry: check-source-has-all-columns
  language: python
  types_or: [yaml]
- id: check-source-table-has-description
  name: Check the source table has description
  description: Ensures that the source table has description in properties file.
  entry: check-source-table-has-description
  language: python
  types_or: [yaml]
- id: check-source-has-freshness
  name: Check the source has the freshness
  description: Ensures that the source has freshness options.
  entry: check-source-has-freshness
  language: python
  types_or: [yaml]
- id: check-source-has-loader
  name: Check the source has loader option
  description: Ensures that the source has loader option.
  entry: check-source-has-loader
  language: python
  types_or: [yaml]
- id: check-source-has-meta-keys
  name: Check the source has keys in the meta part
  description: Ensures that the source has a list of valid meta keys.
  entry: check-source-has-meta-keys
  language: python
  types_or: [yaml]
- id: check-source-has-labels-keys
  name: Check the source has keys in the labels part
  description: Ensures that the source has a list of valid labels keys.
  entry: check-source-has-labels-keys
  language: python
  types_or: [yaml]
- id: check-source-has-tests-by-name
  name: Check the source tests by test name
  description: Ensures that the source has a number of tests of a certain name (e.g. data, unique).
  entry: check-source-has-tests-by-name
  language: python
  types_or: [yaml]
- id: check-source-has-tests-by-type
  name: Check the source tests by test type
  description: Ensures that the source has a number of tests of a certain type (data, schema).
  entry: check-source-has-tests-by-type
  language: python
  types_or: [yaml]
- id: check-source-has-tests-by-group
  name: Check the source tests by test group
  description: Ensures that the source has a number of tests of a certain group (unique, unique-combination-of-columns).
  entry: check-source-has-tests-by-group
  language: python
  types_or: [yaml]
- id: check-source-has-tests
  name: Check the source has tests
  description: Ensures that the source has a number of tests.
  entry: check-source-has-tests
  language: python
  types_or: [yaml]
- id: check-source-tags
  name: Check the source has valid tags
  description: Ensures that the source has only valid tags from the provided list.
  entry: check-source-tags
  language: python
  types_or: [yaml]
- id: dbt-clean
  name: dbt clean
  description: Deletes all folders specified in the clean-targets.
  entry: dbt-clean
  language: python
  pass_filenames: false
- id: dbt-compile
  name: dbt compile
  description: Generates executable SQL from source model, test, and analysis files.
  entry: dbt-compile
  language: python
  types_or: [sql]
  require_serial: true
- id: dbt-deps
  name: dbt deps
  description: Pulls the most recent version of the dependencies listed in your packages.yml.
  entry: dbt-deps
  language: python
  pass_filenames: false
- id: dbt-docs-generate
  name: dbt docs generate
  description: The command is responsible for generating your project's documentation website.
  entry: dbt-docs-generate
  language: python
  pass_filenames: false
- id: dbt-parse
  name: dbt parse
  description: Generates manifest.json from source model, test, and analysis files.
  entry: dbt-parse
  language: python
  types_or: [sql]
  require_serial: true
- id: dbt-run
  name: dbt run
  description: Executes compiled sql model files.
  entry: dbt-run
  language: python
  require_serial: true
  types_or: [sql]
- id: dbt-test
  name: dbt test
  description: Runs tests on data in deployed models.
  entry: dbt-test
  language: python
  require_serial: true
  types_or: [sql]
- id: generate-missing-sources
  name: Generate missing sources
  description: If any source is missing this hook tries to create it.
  entry: generate-missing-sources
  language: python
  types_or: [sql]
- id: generate-model-properties-file
  name: Generate model properties file
  description: Generate model properties file if does not exists.
  entry: generate-model-properties-file
  language: python
  types_or: [sql]
  args:
    [
      "--properties-file",
      "/Users/tomsejr/Documents/03-Workspace/Private/jaffle_shop/{database}/{schema}/{name}.yml",
    ]
  require_serial: true
- id: unify-column-description
  name: Unify column description
  description: Unify column descriptions across all models
  entry: unify-column-description
  language: python
  files: '.*\.(yml|yaml)$'
  require_serial: true
- id: replace-script-table-names
  name: Replace script table names
  description: Replace table names with source or ref macros in the script.
  entry: replace-script-table-names
  language: python
  types_or: [sql]
- id: remove-script-semicolon
  name: Remove script semicolon
  description: Remove semicolon at the end of the script.
  entry: remove-script-semicolon
  language: python
  types_or: [sql]
- id: check-model-materialization-by-childs
  name: Check the materialization of the model by childs
  description: Controls the materialization of the model by its number of childs.
  entry: check-model-materialization-by-childs
  language: python
  types_or: [sql]
- id: check-exposure-has-meta-keys
  name: Check the exposure has keys in the meta part
  description: Ensures that the exposure has a list of valid meta keys.
  entry: check-exposure-has-meta-keys
  language: python
  types_or: [yaml]
- id: check-macro-has-meta-keys
  name: Check the macro has keys in the meta part
  description: Ensures that the macro has a list of valid meta keys.
  entry: check-macro-has-meta-keys
  language: python
  types_or: [yaml]
- id: check-seed-has-meta-keys
  name: Check the seed has keys in the meta part
  description: Ensures that the seed has a list of valid meta keys.
  entry: check-seed-has-meta-keys
  language: python
  types_or: [yaml]
- id: check-snapshot-has-meta-keys
  name: Check the snapshot has keys in the meta part
  description: Ensures that the snapshot has a list of valid meta keys.
  entry: check-snapshot-has-meta-keys
  language: python
  types_or: [sql, yaml]
- id: check-test-has-meta-keys
  name: Check the test has keys in the meta part
  description: Ensures that the test has a list of valid meta keys.
  entry: check-test-has-meta-keys
  language: python
  types_or: [sql]

entry 实现

一个简单例子,实际就是解析元数据,基于规则判断,为了方便处理,开发了一个工具模块utils.py 定义了类型以及元数据解析处理

  • check_macro_has_description.py
import argparse
import os
import time
from typing import Any, Dict, Optional, Sequence
 
from dbt_checkpoint.tracking import dbtCheckpointTracking
from dbt_checkpoint.utils import (
    JsonOpenError,
    add_default_args,
    get_dbt_manifest,
    get_filenames,
    get_macro_schemas,
    get_macro_sqls,
    get_macros,
    get_missing_file_paths,
    red,
)
 
#  基于元数据的规则处理
def has_description(
    paths: Sequence[str], manifest: Dict[str, Any], exclude_pattern: str
) -> Dict[str, Any]:
    paths = get_missing_file_paths(paths, manifest, exclude_pattern=exclude_pattern)
    status_code = 0
    ymls = get_filenames(paths, [".yml", ".yaml"])
    sqls = get_macro_sqls(paths, manifest)
    filenames = set(sqls.keys())
 
    # get manifest macros that pre-commit found as changed
    macros = get_macros(manifest, filenames)
    # if user added schema but did not rerun the macro
    schemas = get_macro_schemas(list(ymls.values()), filenames)
    # convert to sets
    in_macros = {macro.filename for macro in macros if macro.macro.get("description")}
    in_schemas = {
        schema.macro_name for schema in schemas if schema.schema.get("description")
    }
    missing = filenames.difference(in_macros, in_schemas)
 
    for macro in missing:
        status_code = 1
        print(
            f"{red(sqls.get(macro))}: "
            f"does not have defined description or properties file is missing.",
        )
    return {"status_code": status_code}
 
 
def main(argv: Optional[Sequence[str]] = None) -> int:
    parser = argparse.ArgumentParser()
    add_default_args(parser)
 
    args = parser.parse_args(argv)
 
    try:
        manifest = get_dbt_manifest(args)
    except JsonOpenError as e:
        print(f"Unable to load manifest file ({e})")
        return 1
 
    start_time = time.time()
    hook_properties = has_description(
        paths=args.filenames, manifest=manifest, exclude_pattern=args.exclude
    )
    end_time = time.time()
    script_args = vars(args)
 
    tracker = dbtCheckpointTracking(script_args=script_args)
    tracker.track_hook_event(
        event_name="Hook Executed",
        manifest=manifest,
        event_properties={
            "hook_name": os.path.basename(__file__),
            "description": "Check the macro has description.",
            "status": hook_properties.get("status_code"),
            "execution_time": end_time - start_time,
            "is_pytest": script_args.get("is_test"),
        },
    )
 
    return hook_properties.get("status_code")
 
 
if __name__ == "__main__":
    exit(main())
  • entry_points 的 console_scripts 定义
[options.entry_points]
console_scripts =
    check-column-desc-are-same = dbt_checkpoint.check_column_desc_are_same:main
    check-column-name-contract = dbt_checkpoint.check_column_name_contract:main
    check-macro-has-description = dbt_checkpoint.check_macro_has_description:main
    check-macro-arguments-have-desc = dbt_checkpoint.check_macro_arguments_have_desc:main
    check-model-columns-have-desc = dbt_checkpoint.check_model_columns_have_desc:main
    check-model-has-all-columns = dbt_checkpoint.check_model_has_all_columns:main
    check-model-has-description = dbt_checkpoint.check_model_has_description:main
    check-model-has-meta-keys = dbt_checkpoint.check_model_has_meta_keys:main
    check-model-has-labels-keys = dbt_checkpoint.check_model_has_labels_keys:main
    check-model-has-properties-file = dbt_checkpoint.check_model_has_properties_file:main
    check-model-has-tests-by-name = dbt_checkpoint.check_model_has_tests_by_name:main
    check-model-has-tests-by-type = dbt_checkpoint.check_model_has_tests_by_type:main
    check-model-has-tests-by-group = dbt_checkpoint.check_model_has_tests_by_group:main
    check-model-has-tests = dbt_checkpoint.check_model_has_tests:main
    check-model-name-contract = dbt_checkpoint.check_model_name_contract:main
    check-model-parents-and-childs = dbt_checkpoint.check_model_parents_and_childs:main
    check-model-parents-database = dbt_checkpoint.check_model_parents_database:main
    check-model-parents-schema = dbt_checkpoint.check_model_parents_schema:main
    check-model-tags = dbt_checkpoint.check_model_tags:main
    check-script-has-no-table-name = dbt_checkpoint.check_script_has_no_table_name:main
    check-script-ref-and-source = dbt_checkpoint.check_script_ref_and_source:main
    check-script-semicolon = dbt_checkpoint.check_script_semicolon:main
    check-source-childs = dbt_checkpoint.check_source_childs:main
    check-source-columns-have-desc = dbt_checkpoint.check_source_columns_have_desc:main
    check-source-has-all-columns = dbt_checkpoint.check_source_has_all_columns:main
    check-source-table-has-description = dbt_checkpoint.check_source_table_has_description:main
    check-source-has-freshness = dbt_checkpoint.check_source_has_freshness:main
    check-source-has-loader = dbt_checkpoint.check_source_has_loader:main
    check-source-has-meta-keys = dbt_checkpoint.check_source_has_meta_keys:main
    check-source-has-labels-keys = dbt_checkpoint.check_source_has_labels_keys:main
    check-source-has-tests-by-group = dbt_checkpoint.check_source_has_tests_by_group:main
    check-source-has-tests-by-name = dbt_checkpoint.check_source_has_tests_by_name:main
    check-source-has-tests-by-type = dbt_checkpoint.check_source_has_tests_by_type:main
    check-source-has-tests = dbt_checkpoint.check_source_has_tests:main
    check-source-tags = dbt_checkpoint.check_source_tags:main
    check-model-materialization-by-childs = dbt_checkpoint.check_model_materialization_by_childs:main
    dbt-clean = dbt_checkpoint.dbt_clean:main
    dbt-compile = dbt_checkpoint.dbt_compile:main
    dbt-deps = dbt_checkpoint.dbt_deps:main
    dbt-docs-generate = dbt_checkpoint.dbt_docs_generate:main
    dbt-parse = dbt_checkpoint.dbt_parse:main
    dbt-run = dbt_checkpoint.dbt_run:main
    dbt-test = dbt_checkpoint.dbt_test:main
    generate-missing-sources = dbt_checkpoint.generate_missing_sources:main
    generate-model-properties-file = dbt_checkpoint.generate_model_properties_file:main
    unify-column-description = dbt_checkpoint.unify_column_description:main
    replace-script-table-names = dbt_checkpoint.replace_script_table_names:main
    remove-script-semicolon = dbt_checkpoint.remove_script_semicolon:main
    check-snapshot-has-meta-keys = dbt_checkpoint.check_snapshot_has_meta_keys:main
    check-exposure-has-meta-keys = dbt_checkpoint.check_exposure_has_meta_keys:main
    check-macro-has-meta-keys = dbt_checkpoint.check_macro_has_meta_keys:main
    check-seed-has-meta-keys = dbt_checkpoint.check_seed_has_meta_keys:main
    check-test-has-meta-keys = dbt_checkpoint.check_test_has_meta_keys:main

使用

对于使用实际上就是标准的pre-commit 玩法

  • 定义.pre-commit-config.yaml

里边会包含hooks 定义

repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
  rev: v1.2.1
  hooks:
  - id: dbt-parse
  - id: dbt-docs-generate
    args: ["--cmd-flags", "++no-compile"]
  - id: check-script-semicolon
  - id: check-script-has-no-table-name
  - id: check-model-has-all-columns
    name: Check columns - core
    files: ^models/core
  - id: check-model-has-all-columns
    name: Check columns - mart
    files: ^models/mart
  - id: check-model-columns-have-desc
    files: ^models/mart
  • 运行

可以安装为git 的pre-commit (pre-commit install)在git commit 的时候check,也可以直接运行pre-commit run --all-files

说明

dbt-checkpoint 核心是解析dbt元数据,但是也提供了dbt core 一个cli 命令的包装,比如run,compile,deps,clean,test,docs generate,parse 值得试用下

参考资料

https://github.com/dbt-checkpoint/dbt-checkpoint
https://pre-commit.com/hooks.html

posted on 2024-04-16 19:39  荣锋亮  阅读(2)  评论(0编辑  收藏  举报

导航