Flume学习记录

一、Flume概论
- 1. 定义
- 2.基础架构
二、Flume的使用
- 1.官方案例
- 2、其他使用方式
三、Flume进阶
- 1.事务
- 2.Agent内部原理

一、Flume概论

1. 定义

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传

输的系统。Flume 基于流式架构，灵活简单。

Flume最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到HDFS。

2.基础架构

Agent

Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。

Agent 主要有 3 个部分组成，Source、Channel、Sink。
Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种

格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、taildir、

sequence generator、syslog、http、legacy。
Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运

作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个

Sink 的读取操作。

Flume 自带两种 Channel：Memory Channel 和 File Channel。

Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适

用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕

机或者重启都会导致数据丢失。

File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数

据。
Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储

或索引系统、或者被发送到另一个 Flume Agent。

Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定

义。
Event

传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。

Event 由 Header 和 Body 两部分组成，Header 用来存放该 event 的一些属性，为 K-V 结构，

Body 用来存放该条数据，形式为字节数组。

二、Flume的使用

1.官方案例

创建配置文件flume-netcat-logger.conf

vim flume-netcat-logger.conf

添加以下信息

# Name the components on this agent
a1.sources = r1    //r1:表示a1的Source的名称
a1.sinks = k1	   //k1:表示a1的Sink的名称
a1.channels = c1   //c1:表示a1的Channel的名称

# Describe/configure the source
a1.sources.r1.type = netcat    //表示a1的输入源类型为netcat端口类型
a1.sources.r1.bind = localhost //表示a1的监听主机
a1.sources.r1.port = 44444     //表示a1的监听的端口号

# Describe the sink
a1.sinks.k1.type = logger      //表示a1的输出目的地是控制台logger类型

# Use a channel which buffers events in memory
a1.channels.c1.type = memory              //表示a1的channel类型是memory内存型
a1.channels.c1.capacity = 1000            //表示a1的channel总容量1000个event
//表示a1的channel传输时收集到了100条event以后再去提交事务
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1   //表示将r1和c1连接起来
a1.sinks.k1.channel = c1      //表示将k1和c1连接起来

开启flume监听端口

第一种写法

 bin/flume-ng agent --conf conf/ --name a1 --conf-file flume-netcat-logger.conf -Dflume.root.logger=INFO,console

第二种写法

 bin/flume-ng agent -c conf/ -n a1 -f flume-netcat-logger.conf -Dflume.root.logger=INFO,console

使用netcat工具向本机44444端口发送内容
```
nc localhost 44444
hello
OK
```

2、其他使用方式

实时监控单个追加文件

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9820/flume/%Y%m%d/%H

#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
#定义文件上传完，后缀
a3.sources.r3.fileSuffix = .COMPLETED
#是否有文件头
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:9820/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

在使用 Spooling Directory Source 时，不要在监控目录中创建并持续修改文

件；上传完成的文件会以.COMPLETED 结尾；被监控文件夹每 500 毫秒扫描一次文件变动。

实时监控目录下的多个追加文件

Exec source 适用于监控一个实时追加的文件，不能实现断点续传；Spooldir Source

适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步；而 Taildir Source

适合用于监听多个实时追加的文件，并且能够实现断点续传。

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files2/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9820/flume/upload2/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

Taildir Source 维护了一个 json 格式的 position File，其会定期的往 position File

中更新每个文件读取到的最新的位置，因此能够实现断点续传。

Position File 的格式如下：
```
{
    "inode":2496272,
    "pos":12,
    "file":"/opt/module/flume/files/file1.txt"
}
{
	"inode":2496275,
	"pos":12,
	"file":"/opt/module/flume/files/file2.txt"
}
```

注：Linux 中储存文件元数据的区域就叫做 inode，每个 inode 都有一个号码，操作系统

用 inode 号码来识别不同的文件，Unix/Linux 系统内部不使用文件名，而使用 inode 号码来识别文件。

三、Flume进阶

1.事务

2.Agent内部原理

ChannelSelector

ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，

分别是 Replicating（复制）和 Multiplexing（多路复用）。

ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相

应的原则，将不同的 Event 发往不同的 Channel。
SinkProcessor

SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor、LoadBalancingSinkProcessor 和 FailoverSinkProcessor

DefaultSinkProcessor 对应的是单个的 Sink ， LoadBalancingSinkProcessor 和

FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负

载均衡的功能，FailoverSinkProcessor 可以错误恢复的功能。

posted @ 2021-10-04 19:43 Yuutmoo 阅读(95) 评论(0) 收藏举报

刷新页面返回顶部

Yuutmoo