Great Expectations 本身是基于批处理的数据质量工具,但可通过与 实时流处理框架(如 Apache Kafka + Flink) 集成,实现近实时数据质量验证。以下是具体实现案例,模拟商超线上订单实时流的质量监控:
1. 整体架构
实时数据源(模拟订单系统)→ Kafka(消息队列)→ Flink(流处理)→ Great Expectations(实时验证)→ 结果输出(控制台/告警)
2. 环境准备
2.1. 安装额外依赖
# 已安装Great Expectations基础上,增加流处理依赖
pip install kafka-python apache-flink==1.17.0
2.2. 启动必要服务
- Kafka:用于接收实时订单流(需先启动 Zookeeper)
# 启动Zookeeper(默认端口2181) zookeeper-server-start.sh config/zookeeper.properties & # 启动Kafka(默认端口9092) kafka-server-start.sh config/server.properties & # 创建订单主题 kafka-topics.sh --create --topic online_orders_stream --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
3. 步骤 1:创建测试表(MySQL,用于最终存储)
-- 实时订单落地表(模拟线上系统最终存储)
CREATE TABLE online_orders_realtime (
order_id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
product_id VARCHAR(20) NOT NULL,
quantity INT NOT NULL,
total_amount DECIMAL(10,2) NOT NULL,
payment_status ENUM('unpaid', 'paid', 'refunded') NOT NULL,
order_time DATETIME NOT NULL,
event_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- 数据入表时间
);
4. 步骤 2:生成实时测试数据(Kafka 生产者)
创建

View Code
kafka_producer.py,模拟实时订单流:
实时订单流生成脚本

from kafka import KafkaProducer import json from faker import Faker import random from datetime import datetime import time # 初始化 fake = Faker('zh_CN') producer = KafkaProducer( bootstrap_servers=['localhost:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) def generate_order(): """生成单条订单数据(含10%异常数据)""" order = { "order_id": random.randint(10000, 99999), "user_id": random.randint(1000, 9999), "product_id": f"SP-{random.randint(1000, 9999)}", "quantity": random.randint(1, 10), "total_amount": round(random.uniform(10, 1000), 2), "payment_status": random.choice(['unpaid', 'paid', 'refunded']), "order_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S") } # 注入异常数据 if random.random() < 0.1: if random.random() < 0.5: order["total_amount"] = -order["total_amount"] # 金额为负 else: order["product_id"] = f"ERR-{random.randint(1, 99)}" # 商品ID格式错误 return order if __name__ == "__main__": # 持续发送数据(每秒1条,模拟实时流) while True: order = generate_order() producer.send('online_orders_stream', value=order) print(f"发送订单: {order['order_id']}") time.sleep(1) # 控制发送频率
5. 步骤 3:实时验证数据质量(Flink + Great Expectations)
5.1. 定义 Great Expectations 期望套件(实时规则)
创建

View Code
great_expectations/expectations/online_stream_suite.json:
实时订单流质量规则

{ "expectation_suite_name": "online_stream_suite", "data_asset_type": "Dataset", "expectations": [ { "expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "order_id"} }, { "expectation_type": "expect_column_values_to_match_regex", "kwargs": { "column": "product_id", "regex": "^SP-\\d{4}$" } }, { "expectation_type": "expect_column_values_to_be_greater_than", "kwargs": { "column": "total_amount", "value": 0 } }, { "expectation_type": "expect_column_values_to_be_in_set", "kwargs": { "column": "payment_status", "value_set": ["unpaid", "paid", "refunded"] } } ] }
5.2. Flink 流处理与实时验证
创建

View Code
flink_ge_validator.py,消费 Kafka 流并实时验证:
Flink实时验证脚本

from pyflink.datastream import StreamExecutionEnvironment from pyflink.datastream.connectors.kafka import KafkaSource, KafkaOffsetsInitializer from pyflink.common.serialization import SimpleStringSchema from great_expectations.data_context import FileDataContext import json import pandas as pd # 初始化Great Expectations context = FileDataContext.create(project_root_dir=".") suite = context.get_expectation_suite("online_stream_suite") def validate_record(record): """验证单条订单记录""" try: # 将JSON转为DataFrame(GE需要批量格式) df = pd.DataFrame([json.loads(record)]) # 执行验证 batch = context.get_batch( data=df, expectation_suite=suite ) results = batch.validate() # 输出验证结果 if not results["success"]: print(f"❌ 异常订单: {record}") print(f"错误详情: {results['results']}") else: print(f"✅ 订单验证通过: {json.loads(record)['order_id']}") return results except Exception as e: print(f"验证失败: {str(e)}") def main(): env = StreamExecutionEnvironment.get_execution_environment() env.set_parallelism(1) # 配置Kafka源 kafka_source = KafkaSource.builder() .set_bootstrap_servers("localhost:9092") .set_topics("online_orders_stream") .set_group_id("ge-validator-group") .set_starting_offsets(KafkaOffsetsInitializer.earliest()) .set_value_only_deserializer(SimpleStringSchema()) .build() # 读取流并验证 ds = env.from_source(kafka_source) ds.map(validate_record).print() # 执行Flink作业 env.execute("Real-time Order Quality Validation") if __name__ == "__main__": main()
6. 步骤 4:运行与结果输出
6.1. 启动数据生成器
python kafka_producer.py # 持续向Kafka发送订单数据
6.2. 启动 Flink 实时验证
python flink_ge_validator.py # 消费Kafka并实时验证
6.3. 典型输出结果
✅ 订单验证通过: 10001
❌ 异常订单: {"order_id": 10002, "user_id": 2003, "product_id": "ERR-12", "quantity": 2, "total_amount": 199.5, "payment_status": "paid", "order_time": "2024-10-05 15:30:00"}
错误详情: [{"expectation_config": {"expectation_type": "expect_column_values_to_match_regex", "kwargs": {"column": "product_id", "regex": "^SP-\\d{4}$"}}, "result": {"success": false, "observed_value": "ERR-12"}}]
❌ 异常订单: {"order_id": 10003, "user_id": 5001, "product_id": "SP-3456", "quantity": 1, "total_amount": -59.9, "payment_status": "unpaid", "order_time": "2024-10-05 15:30:01"}
错误详情: [{"expectation_config": {"expectation_type": "expect_column_values_to_be_greater_than", "kwargs": {"column": "total_amount", "value": 0}}, "result": {"success": false, "observed_value": -59.9}}]
7. 总结
1. 实时验证实现原理
- 通过 Kafka 接收实时数据流,模拟线上订单系统的实时产出。
- 利用 Flink 作为流处理引擎,低延迟消费数据(毫秒级)。
- 在 Flink 处理逻辑中嵌入 Great Expectations 验证逻辑,每条数据触发一次质量检查。
2. 商超场景价值
- 实时拦截异常:对金额为负、商品 ID 错误的订单实时告警,避免流入下游 OLAP 系统影响分析。
- 快速定位问题:实时输出错误详情,帮助运维人员立即排查线上订单系统的漏洞(如支付计算逻辑错误)。
3. 局限性与优化
- 性能考量:单条验证可能影响 throughput,可改为 微批处理(如每 100 条验证一次)。
- 持久化存储:可将验证结果写入 ClickHouse,结合 Grafana 构建实时质量监控看板。
- 告警集成:异常数据触发时,通过 WebHook 发送邮件 / Slack 告警,实现闭环处理。
通过这种方式,Great Expectations 可突破离线限制,成为实时数据管道中的质量把关工具,尤其适合商超等对交易数据准确性要求高的场景。
posted on
浙公网安备 33010602011771号