01.Flume综合

一、Flume组件

Source---------------------------------------------------------------------

Channel---------------------------------------------------------------------

Sink---------------------------------------------------------------------

二、Flume架构

基本架构

第一组中配置Source、Sink、Channel，它们的值可以有1个或者多个
第二组中配置Source将把数据存储（Put）到哪一个Channel中，可以存储到1个或多个Channel中，同一个Source将数据存储到多个Channel中，实际上是Replication
第三组中配置Sink从哪一个Channel中取（Task）数据，一个Sink只能从一个Channel中取数据

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set channel for source
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2> ...
<Agent>.sources.<Source2>.channels = <Channel1> <Channel2> ...

# set channel for sink
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

多个Agent顺序连接

一般情况下，应该控制这种顺序连接的Agent的数量，因为数据流经的路径变长了，如果不考虑failover的话，出现故障将影响整个Flow上的Agent收集服务

多个Agent的数据汇聚到同一个Agent

这种情况应用的场景比较多，比如要收集Web网站的用户行为日志，Web网站为了可用性使用的负载均衡的集群模式，每个节点都产生用户行为日志，
可以为每个节点都配置一个Agent来单独收集日志数据，然后多个Agent将数据最终汇聚到一个用来存储数据存储系统，如HDFS上

多路Agent

#Replication复制
Replication方式，可以将最前端的数据源复制多份，分别传递到多个channel中，每个channel接收到的数据都是相同的
上面指定了selector的type的值为replication，其他的配置没有指定，使用的Replication方式，Source1会将数据分别存储到Channel1和Channel2，
这两个channel里面存储的数据是相同的，然后数据被传递到Sink1和Sink2

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

#Multiplexing分流
selector可以根据header的值来确定数据传递到哪一个channel
上面selector的type的值为multiplexing，同时配置selector的header信息，还配置了多个selector的mapping的值，即header的值：如果header的值为Value1、Value2，数据从Source1路由到Channel1；
如果header的值为Value2、Value3，数据从Source1路由到Channel2

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

实现Load Balance

Load balancing Sink Processor能够实现load balance功能，上图Agent1是一个路由节点，负责将Channel暂存的Event均衡到对应的多个Sink组件上，而每个Sink组件分别连接到一个独立的Agent上

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2 k3
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

实现failover

Failover Sink Processor能够实现failover功能，具体流程类似load balance，但是内部处理机制与load balance完全不同：Failover Sink Processor维护一个优先级Sink组件列表，
只要有一个Sink组件可用，Event就被传递到下一个组件。如果一个Sink能够成功处理Event，则会加入到一个Pool中，否则会被移出Pool并计算失败次数，设置一个惩罚因子

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2 k3
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 7
a1.sinkgroups.g1.processor.priority.k3 = 6
a1.sinkgroups.g1.processor.maxpenalty = 20000

三、Flume应用示例

Avro Source+Memory Channel+Logger Sink

使用Avro Source Climet接收外部数据源abc.log，Logger作为sink，即通过Avro RPC调用，将数据缓存在agnet channel中，然后通过Logger打印出调用发送的数据

# list the sources, sinks and channels for the agent
agent1.sources = avro-source1
agent1.channels = ch1
agent1.sinks = log-sink1

# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000

# Define an Avro source called avro-source1 on agent1 and tell
#it to bind to 192.168.86.129:9090. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
#不能为0.0.0.0或者127.0.0.1
agent1.sources.avro-source1.bind = 192.168.86.129 
agent1.sources.avro-source1.port = 9090

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger

bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent
bin/flume-ng avro-client -c ./conf/ -H 192.168.86.129 -p 9090 -F /var/ftp/public/flume/abc.log -Dflume.root.logger=DEBUG,console

TailDir Source + Memory Channel + Kafka Sink

#-->设置sources名称
agent.sources = s1
#--> 设置channel名称
agent.channels = c1
#--> 设置sink 名称
agent.sinks = k1

# For each one of the sources, the type is defined
# source 配置
agent.sources.s1.type = org.apache.flume.source.taildir.TaildirSource

agent.sources.s1.positionFile = /var/ftp/public/flume/taildir_position.json
agent.sources.s1.filegroups = f1
agent.sources.s1.filegroups.f1 = /var/ftp/public/flume/abc.log
agent.sources.s1.batchSize = 100
agent.sources.s1.backoffSleepIncrement = 1000
agent.sources.s1.maxBackoffSleep = 5000

# The channel can be defined as follows.
agent.sources.s1.channels = c1

# Each sink's type must be defined
#  k1配置
agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.brokerList = 127.0.0.1:9092
agent.sinks.k1.topic = test
agent.sinks.k1.serializer.class = kafka.serializer.StringEncoder

#Specify the channel the sink should use
agent.sinks.k1.channel = c1

# Each channel's type is defined.
agent.channels.c1.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.c1.capacity = 10000
agent.channels.c1.transactionCapacity=100

bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent

Avro Source+Memory Channel+HDFS Sink

# Define a source, channel, sink
agent1.sources = avro-source1
agent1.channels = ch1
agent1.sinks = hdfs-sink

# Configure channel
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000000
agent1.channels.ch1.transactionCapacity = 500000

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 192.168.86.129:9090. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind =192.168.86.129
agent1.sources.avro-source1.port = 9090

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://h1:8020/data/flume/
agent1.sinks.hdfs-sink1.hdfs.filePrefix = sync_file
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .log
agent1.sinks.hdfs-sink1.hdfs.rollSize = 1048576
agent1.sinks.hdfs-sink1.rollInterval = 0
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
agent1.sinks.hdfs-sink1.hdfs.batchSize = 1500
agent1.sinks.hdfs-sink1.hdfs.round = true
agent1.sinks.hdfs-sink1.hdfs.roundUnit = minute
agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25
agent1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true
agent1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1
agent1.sinks.hdfs-sink1.fileType = SequenceFile
agent1.sinks.hdfs-sink1.writeFormat = TEXT

bin/flume-ng agent -c ./conf/ -f conf/flume-conf-hdfs.properties -Dflume.root.logger=INFO,console -n agent1
bin/flume-ng avro-client -c ./conf/ -H 192.168.86.129 -p 9090 -F /var/ftp/public/flume/abc.log -Dflume.root.logger=DEBUG,console
hdfs dfs -ls /data/flume

Spooling Directory Source+Memory Channel+HDFS Sink

# Define source, channel, sink
agent1.sources = spool-source1
agent1.channels = ch1
agent1.sinks = hdfs-sink1

# Configure channel
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000000
agent1.channels.ch1.transactionCapacity = 500000

# Define and configure an Spool directory source
agent1.sources.spool-source1.channels = ch1
agent1.sources.spool-source1.type = spooldir
agent1.sources.spool-source1.spoolDir = /home/shirdrn/data/
agent1.sources.spool-source1.ignorePattern = event(_\d{4}\-\d{2}\-\d{2}_\d{2}_\d{2})?\.log(\.COMPLETED)?
agent1.sources.spool-source1.batchSize = 50
agent1.sources.spool-source1.inputCharset = UTF-8

# Define and configure a hdfs sink
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://h1:8020/data/flume/
agent1.sinks.hdfs-sink1.hdfs.filePrefix = event_%y-%m-%d_%H_%M_%S
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .log
agent1.sinks.hdfs-sink1.hdfs.rollSize = 1048576
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
agent1.sinks.hdfs-sink1.hdfs.batchSize = 1500
agent1.sinks.hdfs-sink1.hdfs.round = true
agent1.sinks.hdfs-sink1.hdfs.roundUnit = minute
agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25
agent1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true
agent1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1
agent1.sinks.hdfs-sink1.fileType = SequenceFile
agent1.sinks.hdfs-sink1.writeFormat = TEXT
agent1.sinks.hdfs-sink1.rollInterval = 0

bin/flume-ng agent -c ./conf/ -f conf/flume-conf-spool.properties -Dflume.root.logger=INFO,console -n agent1

hdfs dfs -ls /data/flume

Exec Source+Memory Channel+File Roll Sink

# Define source, channel, sink
agent1.sources = tail-source1
agent1.channels = ch1
agent1.sinks = file-sink1

# Configure channel
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000000
agent1.channels.ch1.transactionCapacity = 500000

# Define and configure an Exec source
agent1.sources.tail-source1.channels = ch1
agent1.sources.tail-source1.type = exec
agent1.sources.tail-source1.command = tail -F /home/shirdrn/data/event.log
agent1.sources.tail-source1.shell = /bin/sh -c
agent1.sources.tail-source1.batchSize = 50

# Define and configure a File roll sink
# and connect it to the other end of the same channel.
agent1.sinks.file-sink1.channel = ch1
agent1.sinks.file-sink1.type = file_roll
agent1.sinks.file-sink1.batchSize = 100
agent1.sinks.file-sink1.serializer = TEXT
agent1.sinks.file-sink1.sink.directory = /home/shirdrn/sink_data

bin/flume-ng agent -c ./conf/ -f conf/flume-conf-file.properties -Dflume.root.logger=INFO,console -n agent1

TailDir Source + Kafka Channel + Logger Sink

# Firstly, now that we've defined all of our components, tell
agent.sources = s1
agent.channels = c1
#agent.sinks = k1

# Secondly, Define a TailDir source
agent.sources.s1.type = org.apache.flume.source.taildir.TaildirSource
agent.sources.s1.positionFile = /var/ftp/public/flume/taildir_position.json
agent.sources.s1.filegroups = f1
agent.sources.s1.filegroups.f1 = /var/ftp/public/flume/abc.log
agent.sources.s1.batchSize = 100
agent.sources.s1.backoffSleepIncrement = 1000
agent.sources.s1.maxBackoffSleep = 5000

# Thirdly, Define a logger sink
#agent.sinks.k1.type = logger


# Forthly, Define a kafka channel
agent.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
agent.channels.c1.brokerList = 192.168.86.129:9092
agent.channels.c1.zookeeperConnect = 192.168.86.129:2181
#agent.channels.c1.kafka.consumer.group.id = flume-consumer
agent.channels.c1.topic = test
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100


# Finally, Bind the source and sink to the channel
agent.sources.s1.channels = c1
#agent.sinks.k1.channel = c1

bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent

Kafka Source + Memory Channel + Logger Sink

# Firstly, now that we've defined all of our components, tell
a1.sources = r1  
a1.sinks = k1  
a1.channels = c1  
 
# Secondly, Define a TailDir source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource  
a1.sources.r1.zookeeperConnect = 10.25.20.37:8121  
a1.sources.r1.topic = topic1  
a1.sources.r1.batchSize = 5  
a1.sources.r1.kafka.bootstrap.servers=10.25.20.36:9092 

# Thirdly, Define a logger sink  
a1.sinks.k1.type = logger  
 
# Forthly, Define a memory channel
a1.channels.c1.type = memory  
a1.channels.c1.capacity = 1000  
a1.channels.c1.transactionCapacity = 100  
  
# Finally, Bind the source and sink to the channel
a1.sources.r1.channels = c1  
a1.sinks.k1.channel = c1

bin/flume-ng agent -c ./conf/ -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console -n agent

posted @ 2017-08-04 21:15 桃源仙居阅读(148) 评论(0) 收藏举报

刷新页面返回顶部

桃源仙居

01.Flume综合

一、Flume组件

二、Flume架构

三、Flume应用示例

公告