hive表导入waterdrop数据,记得看版本,小于spark2.3的自行下载1.5自带spark的版本

配置batch.conf.template拷贝一个为batch.conf(以下是我例子配置,按照自己需求可以做调整)

######
###### This config file is a demonstration of batch processing in waterdrop config
######

spark {
  # You can set spark configuration here
  # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
  spark.app.name = "Waterdrop"
  spark.executor.instances = 2
  spark.executor.cores = 1
  spark.executor.memory = "1g"
}

input {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**
  hive {
    pre_sql = "select * from terminal.XX"
    result_table_name = "XX"
  }



  # You can also use other input plugins, such as hdfs
  # hdfs {
  #   result_table_name = "accesslog"
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
  #   format = "json"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of input plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

filter {
#  # split data by specific delimiter
#  split {
#    fields = ["msg", "name"]
#    delimiter = " "
#    result_table_name = "accesslog"
#    remove {
#        source_field = ["imei1", "imei2"]
#    }
#  }



  # you can also you other filter plugins, such as sql
  # sql {
  #   sql = "select * from accesslog where request_time > 1000"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

output {
  # choose stdout output plugin to output data to console
  #stdout {
  #}

      clickhouse {
        host = "127.0.0.1:8123"
        database = "waterdrop"
        table = "access_log"
        fields = ["XX","day"]
        username = "user_richdm"
        password = "richdm"
    }

  # you can also you other output plugins, such as sql
  # hdfs {
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
  #   save_mode = "append"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of output plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

 

执行命令

./start-waterdrop.sh --master yarn --deploy-mode client --config ../config/batch.conf

 

clickhouse的库表都要预先建立好。不会自动给你建立