高职大数据数据采集

Flume数据采集

赛题

在主节点使用Flume采集实时数据生成器10050端口的socket数据,将数据存入到Kafka的Topic中(Topic名称为order,分区数为4),使用Kafka自带的消费者消费order(Topic)中的数据,将前2条数据的结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下;

做题流程

根据分析,source是netcat,sink是kafka

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 10050

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = order
a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

按照要求,创建topic

[root@bigdata1 ~]# kafka-topics.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --create --topic order --partitions 4

脚本需要先打开flume监听端口才能进行发送,所以先启动flume

[root@bigdata1 ds]# flume-ng agent -f ./socket_to_kafka.conf -n a1 -Dflume.root.logger=INFO,console

启动脚本

[root@bigdata1 ~]# /data_log/gen_ds_data_to_socket 

使用kafka自带consumer订阅order测试是否达到题目要求

[root@bigdata1 ~]# kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic order --from-beginning

customer_point_log:(2257|0|2023120985853765|549|'20231205114929');
product_browse:(9850|11088|0|0|'20231202120429');
product_browse:(6124|11088|0|0|'20231202113229');
customer_point_log:(11088|0|2023120985756492|510|'20231205185729');
product_browse:(4577|2257|1|2023120985853765|'20231202034429');

如果发现有大量数据产生,则表示脚本正确,按照题目要求,查看前两条结果

[root@bigdata1 ~]# kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic order --max-messages 2

product_browse:(9820|3593|0|0|'20231202071048');
product_browse:(3955|3540|0|0|'20231202052549');

02

赛题

采用多路复用模式,Flume接收数据注入kafka 的同时,将数据备份到HDFS目录/user/test/flumebackup下,将查看备份目录下的第一个文件的前2条数据的命令与结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下。

做题流程

按照题目来说,拷贝上一步的conf文件,添加一个hdfs sink,采用DataStream的filetype

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 25001

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = ods_mall
a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://bigdata1:9000/user/test/flumebackup/
a1.sinks.k2.hdfs.filetype = DataStrem

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k2.channel = c2
a1.sinks.k1.channel = c1

最终查看第一个文件的前两条数据

[root@bigdata1 conf]# hdfs dfs -cat /user/test/flumebackup/FlumeData.1702108274881 | head -n 2
2023-12-09 15:57:37,950 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritablelJ��5��K8f��8i��M�$�S数据发送开始:product_browse:(4354|16911|1|2023120912081526|'20231202105714');�M�.�0product_browse:(6637|9080|0|0|'20231202080117');�M�.�@product_browse:(11256|9080|1|2023120991562085|'20231202062217');�M�.�1product_browse:(5109|19358|0|0|'20231202053018');�M�.�0product_browse:(6940|9080|0|0|'20231202135517');�M�.�1product_browse:(8896|19358|0|0|'20231202134618');����lJ��5��K8f��8i��M�D@@product_browse:(11482|2754|1|2023120934066018|'20231202145323');�M�DA2customer_point_log:(2754|0|0|10|'20231203220123');�M�DD?product_browse:(1144|2754|1|2023120934066018|'20231202115323');�M�DD0product_browse:(3592|2754|0|0|'20231202003823');����lJ��5��K8f��8i�

中文乱码问题通过安装linux中文字符编码解决

Maxwell采集数据

修改maxwell配置文件

[root@bigdata1 data]# vi /opt/module/maxwell/config.properties
# tl;dr config
log_level=info
#Maxwell数据发送⽬的地,可选配置有stdout|file|kafka|kinesis|pubsub|sqs|rabbitmq|redis
producer=kafka
#⽬标Kafka集群地址
kafka.bootstrap.servers=bigdata1:9092,bigdata2:9092,bigdata3:9092
#⽬标Kafka topic,可静态配置,例如:maxwell,也可动态配置,例如:%{database}_%{table}
kafka_topic=maxwell
#MySQL相关配置
host=bigdata1
user=root
password=123456

后台运行启动maxwell:

[root@bigdata1 maxwell]# ./bin/maxwell -config config.properties -daemon
Redirecting STDOUT to /opt/module/maxwell/bin/ ./logs/MaxwellDaemon.out
Using kafka version: 1.0.0

启动kafka消费者等待数据过来:

[root@bigdata1 ~]# kafka-console-consumer.sh -bootstrap-server bigdata1:9092
-from-beginning -topic maxwell

启动监听脚本(实时数据脚本的前置条件是开放端口)

[root@bigdata1 maxwell-1.29.0]# vi /opt/module/flume-1.9.0/conf/diy.conf 
# Define the agent
agent1.channels = channel1
agent1.sources = source1
agent1.sinks = kafka-sink hdfs-sink

# Configure source
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = 0.0.0.0
agent1.sources.source1.port = 25001

# Configure channel
agent1.channels.channel1.type = memory

# Configure sinks
# Kafka sink
agent1.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.kafka.topic = order2
agent1.sinks.kafka-sink.kafka.bootstrap.servers = bigdata1:9092
agent1.sinks.kafka-sink.channel = channel1

# HDFS sink
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = hdfs://bigdata1:9000/user/test/
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink.channel = channel1

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.kafka-sink.channel = channel1
agent1.sinks.hdfs-sink.channel = channel1

启动flume

bin/flume-ng agent -f conf/diy.conf -n agent1 -Dflume.root.logger=INFO,console

往mysql数据中写⼊数据:

[root@bigdata1 flume-1.9.0]# bin/flume-ng agent -f conf/diy.conf -n agent1 -Dflume.root.logger=INFO,console
posted @ 2024-01-17 02:23  Shachar_xc  阅读(270)  评论(0)    收藏  举报