使用benthos 实现stream load入库到doris

下面给出yaml配置，只有input和output，中间可以自定义数据转换pipeline
当前的数据从kafka中取出来就是json格式，所以不需要进行处理转换，输出段使用http_client组件，配置批处理提高吞吐量
遇到的问题：
1.后台出现publish timeout，提交的频率太高，be compact不及时，调整批次数量，减少提交的事务数量
2.label already exist,每隔事务的唯一标签重复，可能是某一批次消费的数据量很多，达到了http 发送的批次，且时间一致，导致标签一样，
将max_in_flight: 1

目前压测两台be，一台benthos，kafka集群每秒2.7W/s没有出现重复数据，没有丢失数据。

input:
  broker:
    copies: 9
    inputs:
      - kafka:
          addresses:
            - 222.222.222.5:9092
          topics:
            - test_stream_load 
          consumer_group: abc_live
          target_version: 1.1.0
          checkpoint_limit: 10000
          batching: 
            count: 10000
            period: 5s
  processors:
    - log:
        level: debug
        message: "kafka read: *****${! content().string()}*********"

# Config fields, showing default values
# Common config fields, showing default values
output:
  broker:
    copies: 1
    pattern: round_robin
    outputs:     
      - http_client:
          url: http://222.222.222.5:8030/api/db/table/_stream_load
          verb: PUT
          headers:
            Content-Type: application/json
            #Connection: keep-alive
            Expect: 100-continue
            Authorization: Basic ******************
            format: json
            read_json_by_line: true
          rate_limit: "" # No default (optional)
          timeout: 10s
          max_in_flight: 1
          batching:
            ##优化点：频率过高的提交，会导致be publish timeout
            count: 10000
            byte_size: 0
            period: 5s
            check: ""
            processors:
              - archive:
                  format: lines
              ##优化点：1.每一批次的消息增加一个label,加上重试机制，实现exactly-once
              - bloblang:  meta stream_label = hostname()++uuid_v4()
              - log:
                  level: DEBUG
                  message: ${! meta("stream_label")}

posted @ 2024-05-26 10:09 桂花载酒少年游O 阅读(68) 评论(0) 收藏举报

刷新页面返回顶部

桂花载酒少年游

使用benthos 实现stream load入库到doris

公告