Great Expectations 本身是基于批处理的数据质量工具,但可通过与 实时流处理框架(如 Apache Kafka + Flink) 集成,实现近实时数据质量验证。以下是具体实现案例,模拟商超线上订单实时流的质量监控:

1. 整体架构

实时数据源(模拟订单系统)→ Kafka(消息队列)→ Flink(流处理)→ Great Expectations(实时验证)→ 结果输出(控制台/告警)

2. 环境准备

2.1. 安装额外依赖

# 已安装Great Expectations基础上,增加流处理依赖
pip install kafka-python apache-flink==1.17.0

2.2. 启动必要服务

  • Kafka:用于接收实时订单流(需先启动 Zookeeper)
    # 启动Zookeeper(默认端口2181)
    zookeeper-server-start.sh config/zookeeper.properties &
    # 启动Kafka(默认端口9092)
    kafka-server-start.sh config/server.properties &
    # 创建订单主题
    kafka-topics.sh --create --topic online_orders_stream --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
    
     

3. 步骤 1:创建测试表(MySQL,用于最终存储)

-- 实时订单落地表(模拟线上系统最终存储)
CREATE TABLE online_orders_realtime (
  order_id INT PRIMARY KEY AUTO_INCREMENT,
  user_id INT NOT NULL,
  product_id VARCHAR(20) NOT NULL,
  quantity INT NOT NULL,
  total_amount DECIMAL(10,2) NOT NULL,
  payment_status ENUM('unpaid', 'paid', 'refunded') NOT NULL,
  order_time DATETIME NOT NULL,
  event_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP  -- 数据入表时间
);

4. 步骤 2:生成实时测试数据(Kafka 生产者)

创建 kafka_producer.py,模拟实时订单流:
实时订单流生成脚本
from kafka import KafkaProducer
import json
from faker import Faker
import random
from datetime import datetime
import time

# 初始化
fake = Faker('zh_CN')
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def generate_order():
    """生成单条订单数据(含10%异常数据)"""
    order = {
        "order_id": random.randint(10000, 99999),
        "user_id": random.randint(1000, 9999),
        "product_id": f"SP-{random.randint(1000, 9999)}",
        "quantity": random.randint(1, 10),
        "total_amount": round(random.uniform(10, 1000), 2),
        "payment_status": random.choice(['unpaid', 'paid', 'refunded']),
        "order_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }
    
    # 注入异常数据
    if random.random() < 0.1:
        if random.random() < 0.5:
            order["total_amount"] = -order["total_amount"]  # 金额为负
        else:
            order["product_id"] = f"ERR-{random.randint(1, 99)}"  # 商品ID格式错误
    return order

if __name__ == "__main__":
    # 持续发送数据(每秒1条,模拟实时流)
    while True:
        order = generate_order()
        producer.send('online_orders_stream', value=order)
        print(f"发送订单: {order['order_id']}")
        time.sleep(1)  # 控制发送频率
View Code

5. 步骤 3:实时验证数据质量(Flink + Great Expectations)

5.1. 定义 Great Expectations 期望套件(实时规则)

创建 great_expectations/expectations/online_stream_suite.json
实时订单流质量规则
{
  "expectation_suite_name": "online_stream_suite",
  "data_asset_type": "Dataset",
  "expectations": [
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {"column": "order_id"}
    },
    {
      "expectation_type": "expect_column_values_to_match_regex",
      "kwargs": {
        "column": "product_id",
        "regex": "^SP-\\d{4}$"
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_greater_than",
      "kwargs": {
        "column": "total_amount",
        "value": 0
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_in_set",
      "kwargs": {
        "column": "payment_status",
        "value_set": ["unpaid", "paid", "refunded"]
      }
    }
  ]
}
View Code

5.2. Flink 流处理与实时验证

创建 flink_ge_validator.py,消费 Kafka 流并实时验证:
Flink实时验证脚本
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaOffsetsInitializer
from pyflink.common.serialization import SimpleStringSchema
from great_expectations.data_context import FileDataContext
import json
import pandas as pd

# 初始化Great Expectations
context = FileDataContext.create(project_root_dir=".")
suite = context.get_expectation_suite("online_stream_suite")

def validate_record(record):
    """验证单条订单记录"""
    try:
        # 将JSON转为DataFrame(GE需要批量格式)
        df = pd.DataFrame([json.loads(record)])
        # 执行验证
        batch = context.get_batch(
            data=df,
            expectation_suite=suite
        )
        results = batch.validate()
        
        # 输出验证结果
        if not results["success"]:
            print(f"❌ 异常订单: {record}")
            print(f"错误详情: {results['results']}")
        else:
            print(f"✅ 订单验证通过: {json.loads(record)['order_id']}")
        return results
    except Exception as e:
        print(f"验证失败: {str(e)}")

def main():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_parallelism(1)

    # 配置Kafka源
    kafka_source = KafkaSource.builder()
        .set_bootstrap_servers("localhost:9092")
        .set_topics("online_orders_stream")
        .set_group_id("ge-validator-group")
        .set_starting_offsets(KafkaOffsetsInitializer.earliest())
        .set_value_only_deserializer(SimpleStringSchema())
        .build()

    # 读取流并验证
    ds = env.from_source(kafka_source)
    ds.map(validate_record).print()

    # 执行Flink作业
    env.execute("Real-time Order Quality Validation")

if __name__ == "__main__":
    main()
View Code

6. 步骤 4:运行与结果输出

6.1. 启动数据生成器

python kafka_producer.py  # 持续向Kafka发送订单数据

6.2. 启动 Flink 实时验证

python flink_ge_validator.py  # 消费Kafka并实时验证

6.3. 典型输出结果

✅ 订单验证通过: 10001
❌ 异常订单: {"order_id": 10002, "user_id": 2003, "product_id": "ERR-12", "quantity": 2, "total_amount": 199.5, "payment_status": "paid", "order_time": "2024-10-05 15:30:00"}
错误详情: [{"expectation_config": {"expectation_type": "expect_column_values_to_match_regex", "kwargs": {"column": "product_id", "regex": "^SP-\\d{4}$"}}, "result": {"success": false, "observed_value": "ERR-12"}}]

❌ 异常订单: {"order_id": 10003, "user_id": 5001, "product_id": "SP-3456", "quantity": 1, "total_amount": -59.9, "payment_status": "unpaid", "order_time": "2024-10-05 15:30:01"}
错误详情: [{"expectation_config": {"expectation_type": "expect_column_values_to_be_greater_than", "kwargs": {"column": "total_amount", "value": 0}}, "result": {"success": false, "observed_value": -59.9}}]

7. 总结

1. 实时验证实现原理

  • 通过 Kafka 接收实时数据流,模拟线上订单系统的实时产出。
  • 利用 Flink 作为流处理引擎,低延迟消费数据(毫秒级)。
  • 在 Flink 处理逻辑中嵌入 Great Expectations 验证逻辑,每条数据触发一次质量检查。

2. 商超场景价值

  • 实时拦截异常:对金额为负、商品 ID 错误的订单实时告警,避免流入下游 OLAP 系统影响分析。
  • 快速定位问题:实时输出错误详情,帮助运维人员立即排查线上订单系统的漏洞(如支付计算逻辑错误)。

3. 局限性与优化

  • 性能考量:单条验证可能影响 throughput,可改为 微批处理(如每 100 条验证一次)。
  • 持久化存储:可将验证结果写入 ClickHouse,结合 Grafana 构建实时质量监控看板。
  • 告警集成:异常数据触发时,通过 WebHook 发送邮件 / Slack 告警,实现闭环处理。

通过这种方式,Great Expectations 可突破离线限制,成为实时数据管道中的质量把关工具,尤其适合商超等对交易数据准确性要求高的场景。
 posted on 2025-08-13 15:07  xibuhaohao  阅读(73)  评论(0)    收藏  举报