01.Flume综合
一、Flume组件
Source---------------------------------------------------------------------

Channel---------------------------------------------------------------------

Sink---------------------------------------------------------------------

二、Flume架构
-
基本架构

第一组中配置Source、Sink、Channel,它们的值可以有1个或者多个
第二组中配置Source将把数据存储(Put)到哪一个Channel中,可以存储到1个或多个Channel中,同一个Source将数据存储到多个Channel中,实际上是Replication
第三组中配置Sink从哪一个Channel中取(Task)数据,一个Sink只能从一个Channel中取数据
# list the sources, sinks and channels for the agent <Agent>.sources = <Source1> <Source2> <Agent>.sinks = <Sink1> <Sink2> <Agent>.channels = <Channel1> <Channel2> # set channel for source <Agent>.sources.<Source1>.channels = <Channel1> <Channel2> ... <Agent>.sources.<Source2>.channels = <Channel1> <Channel2> ... # set channel for sink <Agent>.sinks.<Sink1>.channel = <Channel1> <Agent>.sinks.<Sink2>.channel = <Channel2>
-
多个Agent顺序连接

一般情况下,应该控制这种顺序连接的Agent的数量,因为数据流经的路径变长了,如果不考虑failover的话,出现故障将影响整个Flow上的Agent收集服务
-
多个Agent的数据汇聚到同一个Agent

这种情况应用的场景比较多,比如要收集Web网站的用户行为日志,Web网站为了可用性使用的负载均衡的集群模式,每个节点都产生用户行为日志,
可以为每个节点都配置一个Agent来单独收集日志数据,然后多个Agent将数据最终汇聚到一个用来存储数据存储系统,如HDFS上
-
多路Agent

#Replication复制 Replication方式,可以将最前端的数据源复制多份,分别传递到多个channel中,每个channel接收到的数据都是相同的 上面指定了selector的type的值为replication,其他的配置没有指定,使用的Replication方式,Source1会将数据分别存储到Channel1和Channel2,
这两个channel里面存储的数据是相同的,然后数据被传递到Sink1和Sink2 # List the sources, sinks and channels for the agent <Agent>.sources = <Source1> <Agent>.sinks = <Sink1> <Sink2> <Agent>.channels = <Channel1> <Channel2> # set list of channels for source (separated by space) <Agent>.sources.<Source1>.channels = <Channel1> <Channel2> # set channel for sinks <Agent>.sinks.<Sink1>.channel = <Channel1> <Agent>.sinks.<Sink2>.channel = <Channel2> <Agent>.sources.<Source1>.selector.type = replicating
#Multiplexing分流 selector可以根据header的值来确定数据传递到哪一个channel 上面selector的type的值为multiplexing,同时配置selector的header信息,还配置了多个selector的mapping的值,即header的值:如果header的值为Value1、Value2,数据从Source1路由到Channel1;
如果header的值为Value2、Value3,数据从Source1路由到Channel2 # Mapping for multiplexing selector <Agent>.sources.<Source1>.selector.type = multiplexing <Agent>.sources.<Source1>.selector.header = <someHeader> <Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1> <Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2> <Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2> #... <Agent>.sources.<Source1>.selector.default = <Channel2>
-
实现Load Balance

Load balancing Sink Processor能够实现load balance功能,上图Agent1是一个路由节点,负责将Channel暂存的Event均衡到对应的多个Sink组件上,而每个Sink组件分别连接到一个独立的Agent上 a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 k3 a1.sinkgroups.g1.processor.type = load_balance a1.sinkgroups.g1.processor.backoff = true a1.sinkgroups.g1.processor.selector = round_robin a1.sinkgroups.g1.processor.selector.maxTimeOut=10000
-
实现failover

Failover Sink Processor能够实现failover功能,具体流程类似load balance,但是内部处理机制与load balance完全不同:Failover Sink Processor维护一个优先级Sink组件列表,
只要有一个Sink组件可用,Event就被传递到下一个组件。如果一个Sink能够成功处理Event,则会加入到一个Pool中,否则会被移出Pool并计算失败次数,设置一个惩罚因子 a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 k3 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 7 a1.sinkgroups.g1.processor.priority.k3 = 6 a1.sinkgroups.g1.processor.maxpenalty = 20000
三、Flume应用示例
- Avro Source+Memory Channel+Logger Sink
使用Avro Source Climet接收外部数据源abc.log,Logger作为sink,即通过Avro RPC调用,将数据缓存在agnet channel中,然后通过Logger打印出调用发送的数据 # list the sources, sinks and channels for the agent agent1.sources = avro-source1 agent1.channels = ch1 agent1.sinks = log-sink1 # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000 # Define an Avro source called avro-source1 on agent1 and tell #it to bind to 192.168.86.129:9090. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro #不能为0.0.0.0或者127.0.0.1 agent1.sources.avro-source1.bind = 192.168.86.129 agent1.sources.avro-source1.port = 9090 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger
bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent bin/flume-ng avro-client -c ./conf/ -H 192.168.86.129 -p 9090 -F /var/ftp/public/flume/abc.log -Dflume.root.logger=DEBUG,console
- TailDir Source + Memory Channel + Kafka Sink
#-->设置sources名称 agent.sources = s1 #--> 设置channel名称 agent.channels = c1 #--> 设置sink 名称 agent.sinks = k1 # For each one of the sources, the type is defined # source 配置 agent.sources.s1.type = org.apache.flume.source.taildir.TaildirSource agent.sources.s1.positionFile = /var/ftp/public/flume/taildir_position.json agent.sources.s1.filegroups = f1 agent.sources.s1.filegroups.f1 = /var/ftp/public/flume/abc.log agent.sources.s1.batchSize = 100 agent.sources.s1.backoffSleepIncrement = 1000 agent.sources.s1.maxBackoffSleep = 5000 # The channel can be defined as follows. agent.sources.s1.channels = c1 # Each sink's type must be defined # k1配置 agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink agent.sinks.k1.brokerList = 127.0.0.1:9092 agent.sinks.k1.topic = test agent.sinks.k1.serializer.class = kafka.serializer.StringEncoder #Specify the channel the sink should use agent.sinks.k1.channel = c1 # Each channel's type is defined. agent.channels.c1.type = memory # Other config values specific to each type of channel(sink or source) # can be defined as well # In this case, it specifies the capacity of the memory channel agent.channels.c1.capacity = 10000 agent.channels.c1.transactionCapacity=100
bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent
- Avro Source+Memory Channel+HDFS Sink
# Define a source, channel, sink agent1.sources = avro-source1 agent1.channels = ch1 agent1.sinks = hdfs-sink # Configure channel agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000000 agent1.channels.ch1.transactionCapacity = 500000 # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 192.168.86.129:9090. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind =192.168.86.129 agent1.sources.avro-source1.port = 9090 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.hdfs-sink1.channel = ch1 agent1.sinks.hdfs-sink1.type = hdfs agent1.sinks.hdfs-sink1.hdfs.path = hdfs://h1:8020/data/flume/ agent1.sinks.hdfs-sink1.hdfs.filePrefix = sync_file agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .log agent1.sinks.hdfs-sink1.hdfs.rollSize = 1048576 agent1.sinks.hdfs-sink1.rollInterval = 0 agent1.sinks.hdfs-sink1.hdfs.rollCount = 0 agent1.sinks.hdfs-sink1.hdfs.batchSize = 1500 agent1.sinks.hdfs-sink1.hdfs.round = true agent1.sinks.hdfs-sink1.hdfs.roundUnit = minute agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25 agent1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true agent1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1 agent1.sinks.hdfs-sink1.fileType = SequenceFile agent1.sinks.hdfs-sink1.writeFormat = TEXT
bin/flume-ng agent -c ./conf/ -f conf/flume-conf-hdfs.properties -Dflume.root.logger=INFO,console -n agent1 bin/flume-ng avro-client -c ./conf/ -H 192.168.86.129 -p 9090 -F /var/ftp/public/flume/abc.log -Dflume.root.logger=DEBUG,console hdfs dfs -ls /data/flume
- Spooling Directory Source+Memory Channel+HDFS Sink
# Define source, channel, sink agent1.sources = spool-source1 agent1.channels = ch1 agent1.sinks = hdfs-sink1 # Configure channel agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000000 agent1.channels.ch1.transactionCapacity = 500000 # Define and configure an Spool directory source agent1.sources.spool-source1.channels = ch1 agent1.sources.spool-source1.type = spooldir agent1.sources.spool-source1.spoolDir = /home/shirdrn/data/ agent1.sources.spool-source1.ignorePattern = event(_\d{4}\-\d{2}\-\d{2}_\d{2}_\d{2})?\.log(\.COMPLETED)? agent1.sources.spool-source1.batchSize = 50 agent1.sources.spool-source1.inputCharset = UTF-8 # Define and configure a hdfs sink agent1.sinks.hdfs-sink1.channel = ch1 agent1.sinks.hdfs-sink1.type = hdfs agent1.sinks.hdfs-sink1.hdfs.path = hdfs://h1:8020/data/flume/ agent1.sinks.hdfs-sink1.hdfs.filePrefix = event_%y-%m-%d_%H_%M_%S agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .log agent1.sinks.hdfs-sink1.hdfs.rollSize = 1048576 agent1.sinks.hdfs-sink1.hdfs.rollCount = 0 agent1.sinks.hdfs-sink1.hdfs.batchSize = 1500 agent1.sinks.hdfs-sink1.hdfs.round = true agent1.sinks.hdfs-sink1.hdfs.roundUnit = minute agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25 agent1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true agent1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1 agent1.sinks.hdfs-sink1.fileType = SequenceFile agent1.sinks.hdfs-sink1.writeFormat = TEXT agent1.sinks.hdfs-sink1.rollInterval = 0
bin/flume-ng agent -c ./conf/ -f conf/flume-conf-spool.properties -Dflume.root.logger=INFO,console -n agent1 hdfs dfs -ls /data/flume
- Exec Source+Memory Channel+File Roll Sink
# Define source, channel, sink agent1.sources = tail-source1 agent1.channels = ch1 agent1.sinks = file-sink1 # Configure channel agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000000 agent1.channels.ch1.transactionCapacity = 500000 # Define and configure an Exec source agent1.sources.tail-source1.channels = ch1 agent1.sources.tail-source1.type = exec agent1.sources.tail-source1.command = tail -F /home/shirdrn/data/event.log agent1.sources.tail-source1.shell = /bin/sh -c agent1.sources.tail-source1.batchSize = 50 # Define and configure a File roll sink # and connect it to the other end of the same channel. agent1.sinks.file-sink1.channel = ch1 agent1.sinks.file-sink1.type = file_roll agent1.sinks.file-sink1.batchSize = 100 agent1.sinks.file-sink1.serializer = TEXT agent1.sinks.file-sink1.sink.directory = /home/shirdrn/sink_data
bin/flume-ng agent -c ./conf/ -f conf/flume-conf-file.properties -Dflume.root.logger=INFO,console -n agent1
- TailDir Source + Kafka Channel + Logger Sink
# Firstly, now that we've defined all of our components, tell agent.sources = s1 agent.channels = c1 #agent.sinks = k1 # Secondly, Define a TailDir source agent.sources.s1.type = org.apache.flume.source.taildir.TaildirSource agent.sources.s1.positionFile = /var/ftp/public/flume/taildir_position.json agent.sources.s1.filegroups = f1 agent.sources.s1.filegroups.f1 = /var/ftp/public/flume/abc.log agent.sources.s1.batchSize = 100 agent.sources.s1.backoffSleepIncrement = 1000 agent.sources.s1.maxBackoffSleep = 5000 # Thirdly, Define a logger sink #agent.sinks.k1.type = logger # Forthly, Define a kafka channel agent.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel agent.channels.c1.brokerList = 192.168.86.129:9092 agent.channels.c1.zookeeperConnect = 192.168.86.129:2181 #agent.channels.c1.kafka.consumer.group.id = flume-consumer agent.channels.c1.topic = test agent.channels.c1.capacity = 1000 agent.channels.c1.transactionCapacity = 100 # Finally, Bind the source and sink to the channel agent.sources.s1.channels = c1 #agent.sinks.k1.channel = c1
bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent
- Kafka Source + Memory Channel + Logger Sink
# Firstly, now that we've defined all of our components, tell a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Secondly, Define a TailDir source a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource a1.sources.r1.zookeeperConnect = 10.25.20.37:8121 a1.sources.r1.topic = topic1 a1.sources.r1.batchSize = 5 a1.sources.r1.kafka.bootstrap.servers=10.25.20.36:9092 # Thirdly, Define a logger sink a1.sinks.k1.type = logger # Forthly, Define a memory channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Finally, Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent
http://www.cnblogs.com/makexu/

浙公网安备 33010602011771号