Apache Flume入门指南[翻译自官方文档]

声明: 根据官方文档选择性的翻译了下,不对请指正 https://flume.apache.org/FlumeUserGuide.html

术语介绍

组件	说明
Agent	一个flume的jvm实例
Client	消息生产者
Source	从client接收数据,传递给channel
Sink	从channel接收数据发送到目的端
Channel	是source和sink的桥梁,类似队列
Events	可以是日志记录、 avro 对象等

启动方法:

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

配置文件示例:

# example.conf: A single-node Flume configuration
 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
 
# Describe the sink
a1.sinks.k1.type = logger
 
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

生产环境配置 --conf=conf_dir,那么conf_dir目录中应该有下面两个文件

flume-env.sh
log4j.properties

记录日志方便debug, 给java传参数

-Dorg.apache.flume.log.printconfig=true

或者在flume-env.sh里设置JAVA_OPTS=-Dorg.apache.flume.log.printconfig=true 日志级别应设置 debug或trace

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

flume也支持配置放在zookeeper中(实验性质)

第三方插件

flume-env.sh 配置FLUME_CLASSPATH 或者放在plugins.d

多agent模式之间必须使用avro类型

flume还支持多路复用,即一份数据可以复制多份传给多个sink

配置文件

注意: 一个source可以有多个channle, 但是一个sink只能有一个channle,如下所示

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
 
# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
 
# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

有两种模式可以支持扇出: 复制技术和多路技术

复制: event发送给所有的channel
多路: 选择性的发送给channel

如果要使用多路技术,需要单独定义一个selector

<Agent>.sources.<Source1>.selector.type = replicating 这是默认的replicating

下面是多路的定义方法:

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...
 
<Agent>.sources.<Source1>.selector.default = <Channel2>  定义一个默认值 如果什么都不匹配则使用这个default

来个具体例子:

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2
 
# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2
 
# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2
 
# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1
 
还有一个optional的例子
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2

当所有的channel都消费完成,selector会尝试往optional也写一份, optional channel写入失败会忽略错误不会重试

如何header配置的channel既是required又是optional的那么这个channel被认为是required, 任何一个channel失败都会导致整个channel集合失败,就像上面的CA,任何一个channel失败都认为 CA失败了

如果一个header没有设置任何channel,那么event会写到default channel,也会往optional写一份. 既是指定了写入optional channel但仍然会往default写. 如果连default也没指定那么event只能写optional channel,写入失败也只是简单的忽略.

Flume source

avro source
下面是配置参数,粗体是必选

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be avro
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	Maximum number of worker threads to spawn
selector.type
selector.*
interceptors	–	Space-separated list of interceptors
interceptors.*
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl	false	Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore	–	This is the path to a Java keystore file. Required for SSL.
keystore-password	–	The password for the Java keystore. Required for SSL.
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
ipFilter	false	Set this to true to enable ipFiltering for netty
ipFilterRules	–	Define N netty ipFilter pattern rules with this config.

ipFilterRules例子如下:

allow:name:localhost,deny:ip:

ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:* 允许本机,禁止其他ip allow:name:localhost,deny:ip:

deny:name:localhost,allow:ip: 禁止本机,允许其他

Thrift source
Thrift source通过kerberos可以配置加密模式
需要设置这两个参数: agent-principal agent-keytab
Exec source ( 不推荐使用)
运行unix命令输出到标准输出(如果没有配置logStdErr=true那么错误会被抛弃)

… 太多source了有兴趣的自己看吧

Flume sinks

支持的sink也很多这里只列举一下吧

- HDFS
- HIVE
- Logger 日志级别是INFO, 这个sink大都用来测试和调试
- Avro
- Thrift 也支持kerberos加密
- IRC
- File Roll
- Null 丢弃收到的消息
- HBase
- AsyncHBase 异步模式
- MorphlineSolr
- ElasticSearch
- Kite Dataset
- Kafka

Flume channel

channel就是在agent端存储event的, source发送过来event,sink去消费

支持的channel如下

- memory
- JDBC 用户持久化存储到数据库, 支持内置的Derby
- Kafka
- File
- Spillable memory
  大体解释下这个channel, 有一个内存的队列和一个文件的channle组成,数据先在queue中存,存不下了就往file channle写,这个channel的想法是在正常情况下的高吞吐就是内存的队列, file channel就是应对突然来了很大的数据量或者agent突然挂了,那么数据还能从file恢复
  重点来了: 实验性质,不推荐在生产环境使用
- Pseudo Transaction 用来做单元测试,不推荐生产环境

Flume channle selector

前面说过了,默认的selector是复制模式,而不是多路模式

来一个多路模式的配置文件

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing  这里注意
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

Flume sink processors

可以给sink分组,实现负载均衡或者容错

Property Name	Default	Description
sinks	–	Space-separated list of sinks that are participating in the group
processor.type	default	The component type name, needs to be default, failover or load_balance

Example for agent named a1:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance

有三种类型的processor: default, failover, load_balance

1. 这篇文档中配置的source-channel-sink就是default模式,什么都不用配置
2. failover
  给多个sink配置优先级,保证至少有一个sink是可用的,优先级的数字越大优先级越高
  这些sink被放到一个池子pool里,发送event失败的sink会被暂时移除pool,并被冷却30s(默认),后续的event由次高优先级的sink处理,如果没有设置优先级那么按照配置文件中的次序
3. load balance
  有两个负载均衡的策略: 轮训(round_robin默认), 随机(random)
  如果有一个sink发送失败,processor会选择下一个sink,如果你没有设置backoff,它也不会被从可用sink中移除,如果所有的sink都失败了,那么这个信息就传递给调用者

Event serializer

file_roll和hdfs sink都支持eventSerializer

1. Body text serializer
  就是普通的文本,event的header会被忽略
2. "Flume event" Avro Event serializer
3. Avro event serializer

Flume interceptor

flume可以使用拦截器修改和丢弃event,支持链式拦截器,多个拦截器用空格分隔

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

Timestamp 在header中添加一个timestamp的key 显示处理event的时间
Host 在header中添加host 保存agent的ip地址
Static 用户可以自定义一个key,一次只能添加一个key,如果要添加多个可以使用多个interceptor拦截器
UUID 生成一个128位的值
Morphline
search and replace 基于java的正则表达式 java matcher.replaceAll
regex filter
regex extractor 把正则分组中的匹配放到header中

flume properties

flume.called.from.service 每30s flume会重新载入配置文件.

如果你这是了这个参数 -Dflume.called.from.service

当agent第一次读取配置文件并且这个文件不存在时,恰巧你设置了上面的参数,那么agent不会报错,并会继续等待30s重载配置文件

如果你没设置上面的参数,那么agent立即退出

当agent到了30s重载配置文件, 而这不是第一次载入这个文件,那么agent不会退出,而是认为配置文件没有变化继续使用旧的配置

Json Reporting

flume可以启动一个内置的http server以json的形式报告各种指标

$ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

posted @ 2017-05-10 16:34 txwsqk 阅读(3027) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

运维从入门到放弃

https://github.com/txwsqk

Apache Flume入门指南[翻译自官方文档]

公告