高职大数据数据采集
Flume数据采集
赛题
在主节点使用Flume采集实时数据生成器10050端口的socket数据,将数据存入到Kafka的Topic中(Topic名称为order,分区数为4),使用Kafka自带的消费者消费order(Topic)中的数据,将前2条数据的结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下;
做题流程
根据分析,source是netcat,sink是kafka
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 10050
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = order
a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
按照要求,创建topic
[root@bigdata1 ~]# kafka-topics.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --create --topic order --partitions 4
脚本需要先打开flume监听端口才能进行发送,所以先启动flume
[root@bigdata1 ds]# flume-ng agent -f ./socket_to_kafka.conf -n a1 -Dflume.root.logger=INFO,console
启动脚本
[root@bigdata1 ~]# /data_log/gen_ds_data_to_socket
使用kafka自带consumer订阅order测试是否达到题目要求
[root@bigdata1 ~]# kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic order --from-beginning
customer_point_log:(2257|0|2023120985853765|549|'20231205114929');
product_browse:(9850|11088|0|0|'20231202120429');
product_browse:(6124|11088|0|0|'20231202113229');
customer_point_log:(11088|0|2023120985756492|510|'20231205185729');
product_browse:(4577|2257|1|2023120985853765|'20231202034429');
如果发现有大量数据产生,则表示脚本正确,按照题目要求,查看前两条结果
[root@bigdata1 ~]# kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic order --max-messages 2
product_browse:(9820|3593|0|0|'20231202071048');
product_browse:(3955|3540|0|0|'20231202052549');
02
赛题
采用多路复用模式,Flume接收数据注入kafka 的同时,将数据备份到HDFS目录/user/test/flumebackup下,将查看备份目录下的第一个文件的前2条数据的命令与结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下。
做题流程
按照题目来说,拷贝上一步的conf文件,添加一个hdfs sink,采用DataStream的filetype
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 25001
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = ods_mall
a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://bigdata1:9000/user/test/flumebackup/
a1.sinks.k2.hdfs.filetype = DataStrem
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k2.channel = c2
a1.sinks.k1.channel = c1
最终查看第一个文件的前两条数据
[root@bigdata1 conf]# hdfs dfs -cat /user/test/flumebackup/FlumeData.1702108274881 | head -n 2
2023-12-09 15:57:37,950 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritablelJ��5��K8f��8i��M�$�S数据发送开始:product_browse:(4354|16911|1|2023120912081526|'20231202105714');�M�.�0product_browse:(6637|9080|0|0|'20231202080117');�M�.�@product_browse:(11256|9080|1|2023120991562085|'20231202062217');�M�.�1product_browse:(5109|19358|0|0|'20231202053018');�M�.�0product_browse:(6940|9080|0|0|'20231202135517');�M�.�1product_browse:(8896|19358|0|0|'20231202134618');����lJ��5��K8f��8i��M�D@@product_browse:(11482|2754|1|2023120934066018|'20231202145323');�M�DA2customer_point_log:(2754|0|0|10|'20231203220123');�M�DD?product_browse:(1144|2754|1|2023120934066018|'20231202115323');�M�DD0product_browse:(3592|2754|0|0|'20231202003823');����lJ��5��K8f��8i�
中文乱码问题通过安装linux中文字符编码解决
Maxwell采集数据
修改maxwell配置文件
[root@bigdata1 data]# vi /opt/module/maxwell/config.properties
# tl;dr config
log_level=info
#Maxwell数据发送⽬的地,可选配置有stdout|file|kafka|kinesis|pubsub|sqs|rabbitmq|redis
producer=kafka
#⽬标Kafka集群地址
kafka.bootstrap.servers=bigdata1:9092,bigdata2:9092,bigdata3:9092
#⽬标Kafka topic,可静态配置,例如:maxwell,也可动态配置,例如:%{database}_%{table}
kafka_topic=maxwell
#MySQL相关配置
host=bigdata1
user=root
password=123456
后台运行启动maxwell:
[root@bigdata1 maxwell]# ./bin/maxwell -config config.properties -daemon
Redirecting STDOUT to /opt/module/maxwell/bin/ ./logs/MaxwellDaemon.out
Using kafka version: 1.0.0
启动kafka消费者等待数据过来:
[root@bigdata1 ~]# kafka-console-consumer.sh -bootstrap-server bigdata1:9092
-from-beginning -topic maxwell
启动监听脚本(实时数据脚本的前置条件是开放端口)
[root@bigdata1 maxwell-1.29.0]# vi /opt/module/flume-1.9.0/conf/diy.conf
# Define the agent
agent1.channels = channel1
agent1.sources = source1
agent1.sinks = kafka-sink hdfs-sink
# Configure source
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = 0.0.0.0
agent1.sources.source1.port = 25001
# Configure channel
agent1.channels.channel1.type = memory
# Configure sinks
# Kafka sink
agent1.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.kafka.topic = order2
agent1.sinks.kafka-sink.kafka.bootstrap.servers = bigdata1:9092
agent1.sinks.kafka-sink.channel = channel1
# HDFS sink
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = hdfs://bigdata1:9000/user/test/
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink.channel = channel1
# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.kafka-sink.channel = channel1
agent1.sinks.hdfs-sink.channel = channel1
启动flume
bin/flume-ng agent -f conf/diy.conf -n agent1 -Dflume.root.logger=INFO,console
往mysql数据中写⼊数据:
[root@bigdata1 flume-1.9.0]# bin/flume-ng agent -f conf/diy.conf -n agent1 -Dflume.root.logger=INFO,console

浙公网安备 33010602011771号