使用benthos 实现stream load入库到doris
下面给出yaml配置,只有input和output,中间可以自定义数据转换pipeline
当前的数据从kafka中取出来就是json格式,所以不需要进行处理转换,输出段使用http_client组件,配置批处理提高吞吐量
遇到的问题:
1.后台出现publish timeout,提交的频率太高,be compact不及时,调整批次数量,减少提交的事务数量
2.label already exist,每隔事务的唯一标签重复,可能是某一批次消费的数据量很多,达到了http 发送的批次,且时间一致,导致标签一样,
将max_in_flight: 1
目前压测两台be,一台benthos,kafka集群每秒2.7W/s没有出现重复数据,没有丢失数据。
input:
broker:
copies: 9
inputs:
- kafka:
addresses:
- 222.222.222.5:9092
topics:
- test_stream_load
consumer_group: abc_live
target_version: 1.1.0
checkpoint_limit: 10000
batching:
count: 10000
period: 5s
processors:
- log:
level: debug
message: "kafka read: *****${! content().string()}*********"
# Config fields, showing default values
# Common config fields, showing default values
output:
broker:
copies: 1
pattern: round_robin
outputs:
- http_client:
url: http://222.222.222.5:8030/api/db/table/_stream_load
verb: PUT
headers:
Content-Type: application/json
#Connection: keep-alive
Expect: 100-continue
Authorization: Basic ******************
format: json
read_json_by_line: true
rate_limit: "" # No default (optional)
timeout: 10s
max_in_flight: 1
batching:
##优化点:频率过高的提交,会导致be publish timeout
count: 10000
byte_size: 0
period: 5s
check: ""
processors:
- archive:
format: lines
##优化点:1.每一批次的消息增加一个label,加上重试机制,实现exactly-once
- bloblang: meta stream_label = hostname()++uuid_v4()
- log:
level: DEBUG
message: ${! meta("stream_label")}

浙公网安备 33010602011771号