卡夫卡快速入门

Kafka作为一个分布式的流平台，这到底意味着什么？

我们认为，一个流处理平台具有三个关键能力：

发布和订阅消息(流)，在这方面，它类似于一个消息队列或企业消息系统。
以容错(故障转移)的方式存储消息(流)。
在消息流发生时处理它们。

什么是kafka的优势？它主要应用于2大类应用：

构建实时的流数据管道，可靠地获取系统和应用程序之间的数据。
构建实时流的应用程序，对数据流进行转换或反应。

要了解kafka是如何做这些事情的，让我们从下到上深入探讨kafka的能力。

首先几个概念：

kafka作为一个集群运行在一个或多个服务器上。
kafka集群存储的消息是以topic为类别记录的。
每个消息（也叫记录record，我习惯叫消息）是由一个key，一个value和时间戳构成。

kafka有四个核心API：

应用程序使用 Producer API 发布消息到1个或多个topic（主题）中。
应用程序使用 Consumer API 来订阅一个或多个topic，并处理产生的消息。
应用程序使用 Streams API 充当一个流处理器，从1个或多个topic消费输入流，并生产一个输出流到1个或多个输出topic，有效地将输入流转换到输出流。
Connector API 可构建或运行可重用的生产者或消费者，将topic连接到现有的应用程序或数据系统。例如，连接到关系数据库的连接器可以捕获表的每个变更。

kafka入门介绍

Client和Server之间的通讯，是通过一条简单、高性能并且和开发语言无关的TCP协议。并且该协议保持与老版本的兼容。Kafka提供了Java Client（客户端）。除了Java客户端外，还有非常多的其它编程语言的客户端。

首先来了解一下Kafka所使用的基本术语：

Topic

Kafka将消息分门别类，每一类的消息称之为一个主题（Topic）。

Producer

发布消息的对象称之为主题生产者（Kafka topic producer）

Consumer

订阅消息并处理发布的消息的对象称之为主题消费者（consumers）

Broker

已发布的消息保存在一组服务器中，称之为Kafka集群。集群中的每一个服务器都是一个代理（Broker）。消费者可以订阅一个或多个主题（topic），并从Broker拉数据，从而消费这些已发布的消息。

主题和日志（Topic和Log）

让我们更深入的了解Kafka中的Topic。

Topic是发布的消息的类别名，一个topic可以有零个，一个或多个消费者订阅该主题的消息。

对于每个topic，Kafka集群都会维护一个分区log，就像下图中所示：

screenshot

每一个分区都是一个顺序的、不可变的消息队列，并且可以持续的添加。分区中的消息都被分了一个序列号，称之为偏移量(offset)，在每个分区中此偏移量都是唯一的。

Kafka集群保持所有的消息，直到它们过期（无论消息是否被消费）。实际上消费者所持有的仅有的元数据就是这个offset（偏移量），也就是说offset由消费者来控制：正常情况当消费者消费消息的时候，偏移量也线性的的增加。但是实际偏移量由消费者控制，消费者可以将偏移量重置为更早的位置，重新读取消息。可以看到这种设计对消费者来说操作自如，一个消费者的操作不会影响其它消费者对此log的处理。

screenshot

再说说分区。Kafka中采用分区的设计有几个目的。一是可以处理更多的消息，不受单台服务器的限制。Topic拥有多个分区意味着它可以不受限的处理更多的数据。第二，分区可以作为并行处理的单元，稍后会谈到这一点。

分布式(Distribution)

Log的分区被分布到集群中的多个服务器上。每个服务器处理它分到的分区。根据配置每个分区还可以复制到其它服务器作为备份容错。每个分区有一个leader，零或多个follower。Leader处理此分区的所有的读写请求，而follower被动的复制数据。如果leader宕机，其它的一个follower会被推举为新的leader。一台服务器可能同时是一个分区的leader，另一个分区的follower。这样可以平衡负载，避免所有的请求都只让一台或者某几台服务器处理。

Geo-Replication(异地数据同步技术)

Kafka MirrorMaker为群集提供geo-replication支持。借助MirrorMaker，消息可以跨多个数据中心或云区域进行复制。您可以在active/passive场景中用于备份和恢复; 或者在active/passive方案中将数据置于更接近用户的位置，或数据本地化。

生产者(Producers)

生产者往某个Topic上发布消息。生产者也负责选择发布到Topic上的哪一个分区。最简单的方式从分区列表中轮流选择。也可以根据某种算法依照权重选择分区。开发者负责如何选择分区的算法。

消费者(Consumers)

通常来讲，消息模型可以分为两种，队列和发布-订阅式。队列的处理方式是一组消费者从服务器读取消息，一条消息只有其中的一个消费者来处理。在发布-订阅模型中，消息被广播给所有的消费者，接收到消息的消费者都可以处理此消息。Kafka为这两种模型提供了单一的消费者抽象模型：消费者组（consumer group）。消费者用一个消费者组名标记自己。一个发布在Topic上消息被分发给此消费者组中的一个消费者。假如所有的消费者都在一个组中，那么这就变成了queue模型。假如所有的消费者都在不同的组中，那么就完全变成了发布-订阅模型。更通用的，我们可以创建一些消费者组作为逻辑上的订阅者。每个组包含数目不等的消费者，一个组内多个消费者可以用来扩展性能和容错。正如下图所示：
screenshot

2个kafka集群托管4个分区（P0-P3），2个消费者组，消费组A有2个消费者实例，消费组B有4个。

正像传统的消息系统一样，Kafka保证消息的顺序不变。再详细扯几句。传统的队列模型保持消息，并且保证它们的先后顺序不变。但是，尽管服务器保证了消息的顺序，消息还是异步的发送给各个消费者，消费者收到消息的先后顺序不能保证了。这也意味着并行消费将不能保证消息的先后顺序。用过传统的消息系统的同学肯定清楚，消息的顺序处理很让人头痛。如果只让一个消费者处理消息，又违背了并行处理的初衷。在这一点上Kafka做的更好，尽管并没有完全解决上述问题。 Kafka采用了一种分而治之的策略：分区。因为Topic分区中消息只能由消费者组中的唯一一个消费者处理，所以消息肯定是按照先后顺序进行处理的。但是它也仅仅是保证Topic的一个分区顺序处理，不能保证跨分区的消息先后处理顺序。所以，如果你想要顺序的处理Topic的所有消息，那就只提供一个分区。

Kafka的保证(Guarantees)

生产者发送到一个特定的Topic的分区上，消息将会按照它们发送的顺序依次加入，也就是说，如果一个消息M1和M2使用相同的producer发送，M1先发送，那么M1将比M2的offset低，并且优先的出现在日志中。
消费者收到的消息也是此顺序。
如果一个Topic配置了复制因子（replication factor）为N，那么可以允许N-1服务器宕机而不丢失任何已经提交（committed）的消息。

有关这些保证的更多详细信息，请参见文档的设计部分。

kafka作为一个消息系统

Kafka的流与传统企业消息系统相比的概念如何？

传统的消息有两种模式：队列和发布订阅。在队列模式中，消费者池从服务器读取消息（每个消息只被其中一个读取）; 发布订阅模式：消息广播给所有的消费者。这两种模式都有优缺点，队列的优点是允许多个消费者瓜分处理数据，这样可以扩展处理。但是，队列不像多个订阅者，一旦消息者进程读取后故障了，那么消息就丢了。而发布和订阅允许你广播数据到多个消费者，由于每个订阅者都订阅了消息，所以没办法缩放处理。

kafka中消费者组有两个概念：队列：消费者组（consumer group）允许同名的消费者组成员瓜分处理。发布订阅：允许你广播消息给多个消费者组（不同名）。

kafka的每个topic都具有这两种模式。

kafka有比传统的消息系统更强的顺序保证。

传统的消息系统按顺序保存数据，如果多个消费者从队列消费，则服务器按存储的顺序发送消息，但是，尽管服务器按顺序发送，消息异步传递到消费者，因此消息可能乱序到达消费者。这意味着消息存在并行消费的情况，顺序就无法保证。消息系统常常通过仅设1个消费者来解决这个问题，但是这意味着没用到并行处理。

kafka做的更好。通过并行topic的parition —— kafka提供了顺序保证和负载均衡。每个partition仅由同一个消费者组中的一个消费者消费到。并确保消费者是该partition的唯一消费者，并按顺序消费数据。每个topic有多个分区，则需要对多个消费者做负载均衡，但请注意，相同的消费者组中不能有比分区更多的消费者，否则多出的消费者一直处于空等待，不会收到消息。

kafka作为一个存储系统

所有发布消息到消息队列和消费分离的系统，实际上都充当了一个存储系统（发布的消息先存储起来）。Kafka比别的系统的优势是它是一个非常高性能的存储系统。

写入到kafka的数据将写到磁盘并复制到集群中保证容错性。并允许生产者等待消息应答，直到消息完全写入。

kafka的磁盘结构 - 无论你服务器上有50KB或50TB，执行是相同的。

client来控制读取数据的位置。你还可以认为kafka是一种专用于高性能，低延迟，提交日志存储，复制，和传播特殊用途的分布式文件系统。

kafka的流处理

仅仅读，写和存储是不够的，kafka的目标是实时的流处理。

在kafka中，流处理持续获取输入topic的数据，进行处理加工，然后写入输出topic。例如，一个零售APP，接收销售和出货的输入流，统计数量或调整价格后输出。

可以直接使用producer和consumer API进行简单的处理。对于复杂的转换，Kafka提供了更强大的Streams API。可构建聚合计算或连接流到一起的复杂应用程序。

助于解决此类应用面临的硬性问题：处理无序的数据，代码更改的再处理，执行状态计算等。

Sterams API在Kafka中的核心：使用producer和consumer API作为输入，利用Kafka做状态存储，使用相同的组机制在stream处理器实例之间进行容错保障。

拼在一起

消息传递，存储和流处理的组合看似反常，但对于Kafka作为流式处理平台的作用至关重要。

像HDFS这样的分布式文件系统允许存储静态文件来进行批处理。这样系统可以有效地存储和处理来自过去的历史数据。

传统企业的消息系统允许在你订阅之后处理未来的消息：在未来数据到达时处理它。

Kafka结合了这两种能力，这种组合对于kafka作为流处理应用和流数据管道平台是至关重要的。

批处理以及消息驱动应用程序的流处理的概念：通过组合存储和低延迟订阅，流处理应用可以用相同的方式对待过去和未来的数据。它是一个单一的应用程序，它可以处理历史的存储数据，当它处理到最后一个消息时，它进入等待未来的数据到达，而不是结束。

同样，对于流数据管道（pipeline），订阅实时事件的组合使得可以将Kafka用于非常低延迟的管道；但是，可靠地存储数据的能力使得它可以将其用于必须保证传递的关键数据，或与仅定期加载数据或长时间维护的离线系统集成在一起。流处理可以在数据到达时转换它。

Kafka的使用场景

消息

kafka更好的替换传统的消息系统，消息系统被用于各种场景（解耦数据生产者，缓存未处理的消息，等），与大多数消息系统比较，kafka有更好的吞吐量，内置分区，副本和故障转移，这有利于处理大规模的消息。

根据我们的经验，消息往往用于较低的吞吐量，但需要低的端到端延迟，并需要提供强大的耐用性的保证。

在这一领域的kafka比得上传统的消息系统，如的ActiveMQ或RabbitMQ的。

网站活动追踪

kafka原本的使用场景：用户的活动追踪，网站的活动（网页游览，搜索或其他用户的操作信息）发布到不同的话题中心，这些消息可实时处理，实时监测，也可加载到Hadoop或离线处理数据仓库。

每个用户页面视图都会产生非常高的量。

指标

kafka也常常用于监测数据。分布式应用程序生成的统计数据集中聚合。

日志聚合

许多人使用Kafka作为日志聚合解决方案的替代品。日志聚合通常从服务器中收集物理日志文件，并将它们放在中央位置（可能是文件服务器或HDFS）进行处理。Kafka抽象出文件的细节，并将日志或事件数据更清晰地抽象为消息流。这允许更低延迟的处理并更容易支持多个数据源和分布式数据消费。

流处理

kafka中消息处理一般包含多个阶段。其中原始输入数据是从kafka主题消费的，然后汇总，丰富，或者以其他的方式处理转化为新主题，例如，一个推荐新闻文章，文章内容可能从“articles”主题获取；然后进一步处理内容，得到一个处理后的新内容，最后推荐给用户。这种处理是基于单个主题的实时数据流。从0.10.0.0开始，轻量，但功能强大的流处理，就可以这样进行数据处理了。

除了Kafka Streams，还有Apache Storm和Apache Samza可选择。

事件采集

事件采集是一种应用程序的设计风格，其中状态的变化根据时间的顺序记录下来，kafka支持这种非常大的存储日志数据的场景。

提交日志

kafka可以作为一种分布式的外部日志，可帮助节点之间复制数据，并作为失败的节点来恢复数据重新同步，kafka的日志压缩功能很好的支持这种用法，这种用法类似于Apacha BookKeeper项目。

kafka安装和启动

kafka的背景知识已经讲了很多了，让我们现在开始实践吧，假设你现在没有Kafka和ZooKeeper环境。

Step 1: 下载代码

下载1.1.0版本并且解压它。

> tar -xzf kafka_2.12-2.3.0.tgz
> cd kafka_2.12-2.3.0

Step 2: 启动服务

运行kafka需要使用Zookeeper，所以你需要先启动Zookeeper，如果你没有Zookeeper，你可以使用kafka自带打包和配置好的Zookeeper。

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

现在启动kafka服务

> bin/kafka-server-start.sh config/server.properties &
[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
...

Step 3: 创建一个主题(topic)

创建一个名为“test”的Topic，只有一个分区和一个备份：

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

创建好之后，可以通过运行以下命令，查看已创建的topic信息：

> bin/kafka-topics.sh --list --zookeeper localhost:2181
test

或者，除了手工创建topic外，你也可以配置你的broker，当发布一个不存在的topic时自动创建topic。

Step 4: 发送消息

Kafka提供了一个命令行的工具，可以从输入文件或者命令行中读取消息并发送给Kafka集群。每一行是一条消息。
运行producer（生产者）,然后在控制台输入几条消息到服务器。

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message

Step 5: 消费消息

Kafka也提供了一个消费消息的命令行工具，将存储的信息输出出来。

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message

如果你有2台不同的终端上运行上述命令，那么当你在运行生产者时，消费者就能消费到生产者发送的消息。

Step 6: 设置多个broker集群

到目前，我们只是单一的运行一个broker，没什么意思。对于Kafka，一个broker仅仅只是一个集群的大小，所有让我们多设几个broker。

首先为每个broker创建一个配置文件:

> cp config/server.properties config/server-1.properties 
> cp config/server.properties config/server-2.properties

现在编辑这些新建的文件，设置以下属性：

config/server-1.properties: 
    broker.id=1 
    listeners=PLAINTEXT://:9093 
    log.dir=/tmp/kafka-logs-1

config/server-2.properties: 
    broker.id=2 
    listeners=PLAINTEXT://:9094 
    log.dir=/tmp/kafka-logs-2

broker.id是集群中每个节点的唯一且永久的名称，我们修改端口和日志目录是因为我们现在在同一台机器上运行，我们要防止broker在同一端口上注册和覆盖对方的数据。

我们已经运行了zookeeper和刚才的一个kafka节点，所有我们只需要在启动2个新的kafka节点。

> bin/kafka-server-start.sh config/server-1.properties &
... 
> bin/kafka-server-start.sh config/server-2.properties &
...

现在，我们创建一个新topic，把备份设置为：3

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

好了，现在我们已经有了一个集群了，我们怎么知道每个集群在做什么呢？运行命令“describe topics”

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: my-replicated-topic    Partition: 0    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0

输出解释：第一行是所有分区的摘要，其次，每一行提供一个分区信息，因为我们只有一个分区，所以只有一行。

"leader"：该节点负责该分区的所有的读和写，每个节点的leader都是随机选择的。
"replicas"：备份的节点列表，无论该节点是否是leader或者目前是否还活着，只是显示。
"isr"：“同步备份”的节点列表，也就是活着的节点并且正在同步leader。

我们运行这个命令，看看一开始我们创建的那个节点：

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
Topic:test    PartitionCount:1    ReplicationFactor:1    Configs:
Topic: test    Partition: 0    Leader: 0    Replicas: 0    Isr: 0

这并不奇怪，刚才创建的主题没有Replicas，并且在服务器“0”上，我们创建它的时候，集群中只有一个服务器，所以是“0”。

让我们来发布一些信息在新的topic上：

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-replicated-topic
 ...
my test message 1
my test message 2
^C

现在，消费这些消息。

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic
 ...
my test message 1
my test message 2
^C

我们要测试集群的容错，kill掉leader，Broker1作为当前的leader，也就是kill掉Broker1。

> ps | grep server-1.properties
7564 ttys002    0:15.91 /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java... 
> kill -9 7564

在Windows上使用：

> wmic process where "caption = 'java.exe' and commandline like '%server-1.properties%'" get processid
ProcessId
6016
> taskkill /pid 6016 /f

备份节点之一成为新的leader，而broker1已经不在同步备份集合里了。

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: my-replicated-topic    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2,0

但是，消息仍然没丢：

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic
...
my test message 1
my test message 2
^C

Step 7: 使用 Kafka Connect 来导入/导出数据

从控制台写入和写回数据是一个方便的开始，但你可能想要从其他来源导入或导出数据到其他系统。对于大多数系统，可以使用kafka Connect，而不需要编写自定义集成代码。

Kafka Connect是导入和导出数据的一个工具。它是一个可扩展的工具，运行连接器，实现与自定义的逻辑的外部系统交互。在这个快速入门里，我们将看到如何运行Kafka Connect用简单的连接器从文件导入数据到Kafka主题，再从Kafka主题导出数据到文件。

首先，我们首先创建一些“种子”数据用来测试，（ps：种子的意思就是造一些消息，片友秒懂？）：

echo -e "foo\nbar" > test.txt

windowns上：

> echo foo> test.txt
> echo bar>> test.txt

接下来，我们开始2个连接器运行在独立的模式，这意味着它们运行在一个单一的，本地的，专用的进程。我们提供3个配置文件作为参数。首先是Kafka Connect处理的配置，包含常见的配置，例如要连接的Kafka broker和数据的序列化格式。其余的配置文件都指定了要创建的连接器。包括连接器唯一名称，和要实例化的连接器类。以及连接器所需的任何其他配置。

> bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

kafka附带了这些示例的配置文件，并且使用了刚才我们搭建的本地集群配置并创建了2个连接器：第一个是源连接器，从输入文件中读取并发布到Kafka主题中，第二个是接收连接器，从kafka主题读取消息输出到外部文件。

在启动过程中，你会看到一些日志消息，包括一些连接器实例化的说明。一旦kafka Connect进程已经开始，导入连接器应该读取从

test.txt

和写入到topic

connect-test

,导出连接器从主题

connect-test

读取消息写入到文件

test.sink.txt

. 我们可以通过验证输出文件的内容来验证数据数据已经全部导出：

more test.sink.txt
 foo
 bar

注意，导入的数据也已经在Kafka主题

connect-test

里,所以我们可以使用该命令查看这个主题：

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning

{"schema":{"type":"string","optional":false},"payload":"foo"}
{"schema":{"type":"string","optional":false},"payload":"bar"}

连接器继续处理数据，因此我们可以添加数据到文件并通过管道移动：

echo "Another line" >> test.txt

你应该会看到出现在消费者控台输出一行信息并导出到文件。

Step 8: 使用Kafka Stream来处理数据

Kafka Stream是kafka的客户端库，用于实时流处理和分析存储在kafka broker的数据，这个快速入门示例将演示如何运行一个流应用程序。一个WordCountDemo的例子（为了方便阅读，使用的是java8 lambda表达式）

KTable wordCounts = textLines
    // Split each text line, by whitespace, into words.
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))

    // Ensure the words are available as record keys for the next aggregate operation.
    .map((key, value) -> new KeyValue<>(value, value))

    // Count the occurrences of each word (record key) and store the results into a table named "Counts".
    .countByKey("Counts")

它实现了wordcount算法，从输入的文本计算出一个词出现的次数。然而，不像其他的WordCount的例子，你可能会看到，在有限的数据之前，执行的演示应用程序的行为略有不同，因为它的目的是在一个无限的操作，数据流。类似的有界变量，它是一种动态算法，跟踪和更新的单词计数。然而，由于它必须假设潜在的无界输入数据，它会定期输出其当前状态和结果，同时继续处理更多的数据，因为它不知道什么时候它处理过的“所有”的输入数据。

现在准备输入数据到kafka的topic中，随后kafka Stream应用处理这个topic的数据。

> echo -e "all streams lead to kafka\nhello kafka streams\njoin kafka summit" > file-input.txt

接下来，使用控制台的producer 将输入的数据发送到指定的topic（streams-file-input）中，（在实践中，stream数据可能会持续流入，其中kafka的应用将启动并运行）

> bin/kafka-topics.sh --create \
            --zookeeper localhost:2181 \
            --replication-factor 1 \
            --partitions 1 \
            --topic streams-file-input

> cat /tmp/file-input.txt | ./bin/kafka-console-producer --broker-list localhost:9092 --topic streams-file-input

现在，我们运行 WordCount 处理输入的数据：

./bin/kafka-run-class org.apache.kafka.streams.examples.wordcount.WordCountDemo

不会有任何的STDOUT输出，除了日志，结果不断地写回另一个topic（streams-wordcount-output），demo运行几秒，然后，不像典型的流处理应用程序，自动终止。

现在我们检查WordCountDemo应用，从输出的topic读取。

./bin/kafka-console-consumer --zookeeper localhost:2181 
            --topic streams-wordcount-output 
            --from-beginning 
            --formatter kafka.tools.DefaultMessageFormatter 
            --property print.key=true 
            --property print.key=true 
            --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer 
            --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

输出数据打印到控台（你可以使用Ctrl-C停止）：

all     1
streams 1
lead    1
to      1
kafka   1
hello   1
kafka   2
streams 2
join    1
kafka   3
summit  1
^C

第一列是message的key，第二列是message的value，要注意，输出的实际是一个连续的更新流，其中每条数据（即：原始输出的每行）是一个单词的最新的count，又叫记录键“kafka”。对于同一个key有多个记录，每个记录之后是前一个的更新。

kafka的生态系统

还有很多与kafka集成的外部的工具。更多信息点击这里，包含了stream处理系统，hadoop的集成，监控和部署工具。

kafka接口API

Apache Kafka引入一个新的java客户端（在org.apache.kafka.clients 包中），替代老的Scala客户端，但是为了兼容，将会共存一段时间。为了减少依赖，这些客户端都有一个独立的jar，而旧的Scala客户端继续与服务端保留在同个包下。

Kafka有4个核心API：

Producer API 允许应用程序发送数据流到kafka集群中的topic。
Consumer API 允许应用程序从kafka集群的topic中读取数据流。
Streams API 允许从输入topic转换数据流到输出topic。
Connect API 通过实现连接器（connector），不断地从一些源系统或应用程序中拉取数据到kafka，或从kafka提交数据到宿系统（sink system）或应用程序。

kafka公开了其所有的功能协议，与语言无关。只有java客户端作为kafka项目的一部分进行维护，其他的作为开源的项目提供，这里提供了非java客户端的列表。
https://cwiki.apache.org/confluence/display/KAFKA/Clients

kafka客户端发布`record(消息)`到kafka集群。

新的生产者是线程安全的，在线程之间共享单个生产者实例，通常单例比多个实例要快。

一个简单的例子，使用producer发送一个有序的key/value(键值对)，放到java的main方法里就能直接运行，

Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:9092");
 props.put("acks", "all");
 props.put("retries", 0);
 props.put("batch.size", 16384);
 props.put("linger.ms", 1);
 props.put("buffer.memory", 33554432);
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

 Producer<String, String> producer = new KafkaProducer<>(props);
 for(int i = 0; i < 100; i++)
     producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));

 producer.close();

生产者的缓冲空间池保留尚未发送到服务器的消息，后台I/O线程负责将这些消息转换成请求发送到集群。如果使用后不关闭生产者，则会泄露这些资源。

send()方法是异步的，添加消息到缓冲区等待发送，并立即返回。生产者将单个的消息批量在一起发送来提高效率。

ack是判别请求是否为完整的条件（就是是判断是不是成功发送了）。我们指定了“all”将会阻塞消息，这种设置性能最低，但是是最可靠的。

retries，如果请求失败，生产者会自动重试，我们指定是0次，如果启用重试，则会有重复消息的可能性。

producer(生产者)缓存每个分区未发送的消息。缓存的大小是通过 batch.size 配置指定的。值较大的话将会产生更大的批。并需要更多的内存（因为每个“活跃”的分区都有1个缓冲区）。

默认缓冲可立即发送，即便缓冲空间还没有满，但是，如果你想减少请求的数量，可以设置linger.ms大于0。这将指示生产者发送请求之前等待一段时间，希望更多的消息填补到未满的批中。这类似于TCP的算法，例如上面的代码段，可能100条消息在一个请求发送，因为我们设置了linger(逗留)时间为1毫秒，然后，如果我们没有填满缓冲区，这个设置将增加1毫秒的延迟请求以等待更多的消息。需要注意的是，在高负载下，相近的时间一般也会组成批，即使是 linger.ms=0。在不处于高负载的情况下，如果设置比0大，以少量的延迟代价换取更少的，更有效的请求。

buffer.memory 控制生产者可用的缓存总量，如果消息发送速度比其传输到服务器的快，将会耗尽这个缓存空间。当缓存空间耗尽，其他发送调用将被阻塞，阻塞时间的阈值通过max.block.ms设定，之后它将抛出一个TimeoutException。

key.serializer和value.serializer示例，将用户提供的key和value对象ProducerRecord转换成字节，你可以使用附带的ByteArraySerializaer或StringSerializer处理简单的string或byte类型。

send()

public Future<RecordMetadata> send(ProducerRecord<K,V> record,Callback callback)

异步发送一条消息到topic，并调用callback（当发送已确认）。

send是异步的，并且一旦消息被保存在等待发送的消息缓存中，此方法就立即返回。这样并行发送多条消息而不阻塞去等待每一条消息的响应。

发送的结果是一个RecordMetadata，它指定了消息发送的分区，分配的offset和消息的时间戳。如果topic使用的是CreateTime，则使用用户提供的时间戳或发送的时间（如果用户没有指定指定消息的时间戳）如果topic使用的是LogAppendTime，则追加消息时，时间戳是broker的本地时间。

由于send调用是异步的，它将为分配消息的此消息的RecordMetadata返回一个Future。如果future调用get()，则将阻塞，直到相关请求完成并返回该消息的metadata，或抛出发送异常。

如果要模拟一个简单的阻塞调用，你可以调用get()方法。

 byte[] key = "key".getBytes();
 byte[] value = "value".getBytes();
 ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value)
 producer.send(record).get();

完全无阻塞的话,可以利用回调参数提供的请求完成时将调用的回调通知。

 ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("the-topic", key, value);
 producer.send(myRecord,
               new Callback() {
                   public void onCompletion(RecordMetadata metadata, Exception e) {
                       if(e != null)
                           e.printStackTrace();
                       System.out.println("The offset of the record we just sent is: " + metadata.offset());
                   }
               });

发送到同一个分区的消息回调保证按一定的顺序执行，也就是说，在下面的例子中 callback1 保证执行 callback2 之前：

producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key1, value1), callback1);
producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key2, value2), callback2);

注意：callback一般在生产者的I/O线程中执行，所以是相当的快的，否则将延迟其他的线程的消息发送。如果你需要执行阻塞或计算昂贵（消耗）的回调，建议在callback主体中使用自己的Executor来并行处理。

pecified by:

send in interface Producer<K,V>

Parameters:

record - 发送的记录（消息）
callback - 用户提供的callback，服务器来调用这个callback来应答结果（null表示没有callback）。

Throws:

InterruptException - 如果线程在阻塞中断。
SerializationException - 如果key或value不是给定有效配置的serializers。
TimeoutException - 如果获取元数据或消息分配内存话费的时间超过max.block.ms。
KafkaException - Kafka有关的错误（不属于公共API的异常）。

kafka消费者API

随着0.9.0版本，我们已经增加了一个新的Java消费者替换我们现有的基于zookeeper的高级和低级消费者。这个客户端还是测试版的质量。为了确保用户平滑升级，我们仍然维护旧的0.8版本的消费者客户端继续在0.9集群上工作，两个老的0.8 API的消费者（高级消费者和低级消费者）。

这个新的消费API，清除了0.8版本的高版本和低版本消费者之间的区别，你可以通过下面的maven，引入依赖到你的客户端。

<dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>0.10.1.0</version>
</dependency>

如何使用新的消费者，请点击这里。

kafka消费者客户端

Kafka客户端从集群中消费消息，并透明地处理kafka集群中出现故障服务器，透明地调节适应集群中变化的数据分区。也和服务器交互，平衡均衡消费者。

public class KafkaConsumer<K,V>
extends Object
implements Consumer<K,V>

消费者TCP长连接到broker来拉取消息。故障导致的消费者关闭失败，将会泄露这些连接，消费者不是线程安全的，可以查看更多关于Multi-threaded（多线程）处理的细节。

跨版本兼容性

该客户端可以与0.10.0或更新版本的broker集群进行通信。较早的版本可能不支持某些功能。例如，0.10.0broker不支持offsetsForTimes，因为此功能是在版本0.10.1中添加的。如果你调用broker版本不可用的API时，将报 UnsupportedVersionException 异常。

偏移量和消费者的位置

kafka为分区中的每条消息保存一个偏移量（offset），这个偏移量是该分区中一条消息的唯一标示符。也表示消费者在分区的位置。例如，一个位置是5的消费者(说明已经消费了0到4的消息)，下一个接收消息的偏移量为5的消息。实际上有两个与消费者相关的“位置”概念：

消费者的位置给出了下一条记录的偏移量。它比消费者在该分区中看到的最大偏移量要大一个。它在每次消费者在调用poll(long)中接收消息时自动增长。

“已提交”的位置是已安全保存的最后偏移量，如果进程失败或重新启动时，消费者将恢复到这个偏移量。消费者可以选择定期自动提交偏移量，也可以选择通过调用commit API来手动的控制(如：commitSync 和 commitAsync)。

这个区别是消费者来控制一条消息什么时候才被认为是已被消费的，控制权在消费者，下面我们进一步更详细地讨论。

消费者组和主题订阅

Kafka的消费者组概念，通过进程池瓜分消息并处理消息。这些进程可以在同一台机器运行，也可分布到多台机器上，以增加可扩展性和容错性，相同group.id的消费者将视为同一个消费者组。

分组中的每个消费者都通过subscribe API动态的订阅一个topic列表。kafka将已订阅topic的消息发送到每个消费者组中。并通过平衡分区在消费者分组中所有成员之间来达到平均。因此每个分区恰好地分配1个消费者（一个消费者组中）。所有如果一个topic有4个分区，并且一个消费者分组有只有2个消费者。那么每个消费者将消费2个分区。

消费者组的成员是动态维护的：如果一个消费者故障。分配给它的分区将重新分配给同一个分组中其他的消费者。同样的，如果一个新的消费者加入到分组，将从现有消费者中移一个给它。这被称为重新平衡分组，并在下面更详细地讨论。当新分区添加到订阅的topic时，或者当创建与订阅的正则表达式匹配的新topic时，也将重新平衡。将通过定时刷新自动发现新的分区，并将其分配给分组的成员。

从概念上讲，你可以将消费者分组看作是由多个进程组成的单一逻辑订阅者。作为一个多订阅系统，Kafka支持对于给定topic任何数量的消费者组，而不重复。

这是在消息系统中常见的功能的略微概括。所有进程都将是单个消费者分组的一部分（类似传统消息传递系统中的队列的语义），因此消息传递就像队列一样，在组中平衡。与传统的消息系统不同的是，虽然，你可以有多个这样的组。但每个进程都有自己的消费者组（类似于传统消息系统中pub-sub的语义），因此每个进程都会订阅到该主题的所有消息。

此外，当分组重新分配自动发生时，可以通过ConsumerRebalanceListener通知消费者，这允许他们完成必要的应用程序级逻辑，例如状态清除，手动偏移提交等。有关更多详细信息，请参阅Kafka存储的偏移。

它也允许消费者通过使用assign(Collection)手动分配指定分区，如果使用手动指定分配分区，那么动态分区分配和协调消费者组将失效。

发现消费者故障

订阅一组topic后，当调用poll(long）时，消费者将自动加入到组中。只要持续的调用poll，消费者将一直保持可用，并继续从分配的分区中接收消息。此外，消费者向服务器定时发送心跳。如果消费者崩溃或无法在session.timeout.ms配置的时间内发送心跳，则消费者将被视为死亡，并且其分区将被重新分配。

还有一种可能，消费可能遇到“活锁”的情况，它持续的发送心跳，但是没有处理。为了预防消费者在这种情况下一直持有分区，我们使用max.poll.interval.ms活跃检测机制。在此基础上，如果你调用的poll的频率大于最大间隔，则客户端将主动地离开组，以便其他消费者接管该分区。发生这种情况时，你会看到offset提交失败（调用commitSync（）引发的CommitFailedException）。这是一种安全机制，保障只有活动成员能够提交offset。所以要留在组中，你必须持续调用poll。

消费者提供两个配置设置来控制poll循环：

max.poll.interval.ms：增大poll的间隔，可以为消费者提供更多的时间去处理返回的消息（调用poll(long)返回的消息，通常返回的消息都是一批）。缺点是此值越大将会延迟组重新平衡。
max.poll.records：此设置限制每次调用poll返回的消息数，这样可以更容易的预测每次poll间隔要处理的最大值。通过调整此值，可以减少poll间隔，减少重新平衡分组的

对于消息处理时间不可预测地的情况，这些选项是不够的。处理这种情况的推荐方法是将消息处理移到另一个线程中，让消费者继续调用poll。但是必须注意确保已提交的offset不超过实际位置。另外，你必须禁用自动提交，并只有在线程完成处理后才为记录手动提交偏移量（取决于你）。还要注意，你需要pause暂停分区，不会从poll接收到新消息，让线程处理完之前返回的消息（如果你的处理能力比拉取消息的慢，那创建新线程将导致你机器内存溢出）。

示例

这个消费者API提供了灵活性，以涵盖各种消费场景，下面是一些例子来演示如何使用它们。

自动提交偏移量

这是个【自动提交偏移量】的简单的kafka消费者API。

  Properties props = new Properties();
     props.put("bootstrap.servers", "localhost:9092");
     props.put("group.id", "test");
     props.put("enable.auto.commit", "true");
     props.put("auto.commit.interval.ms", "1000");
     props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
     consumer.subscribe(Arrays.asList("foo", "bar"));
     while (true) {
         ConsumerRecords<String, String> records = consumer.poll(100);
         for (ConsumerRecord<String, String> record : records)
             System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
     }

设置enable.auto.commit,偏移量由auto.commit.interval.ms控制自动提交的频率。

集群是通过配置bootstrap.servers指定一个或多个broker。不用指定全部的broker，它将自动发现集群中的其余的borker（最好指定多个，万一有服务器故障）。

在这个例子中，客户端订阅了主题foo和bar。消费者组叫test。

broker通过心跳机器自动检测test组中失败的进程，消费者会自动ping集群，告诉进群它还活着。只要消费者能够做到这一点，它就被认为是活着的，并保留分配给它分区的权利，如果它停止心跳的时间超过session.timeout.ms,那么就会认为是故障的，它的分区将被分配到别的进程。

这个deserializer设置如何把byte转成object类型，例子中，通过指定string解析器，我们告诉获取到的消息的key和value只是简单个string类型。

手动控制偏移量

不需要定时的提交offset，可以自己控制offset，当消息认为已消费过了，这个时候再去提交它们的偏移量。这个很有用的，当消费的消息结合了一些处理逻辑，这个消息就不应该认为是已经消费的，直到它完成了整个处理。

Properties props = new Properties();
     props.put("bootstrap.servers", "localhost:9092");
     props.put("group.id", "test");
     props.put("enable.auto.commit", "false");
     props.put("auto.commit.interval.ms", "1000");
     props.put("session.timeout.ms", "30000");
     props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
     consumer.subscribe(Arrays.asList("foo", "bar"));
     final int minBatchSize = 200;
     List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
     while (true) {
         ConsumerRecords<String, String> records = consumer.poll(100);
         for (ConsumerRecord<String, String> record : records) {
             buffer.add(record);
         }
         if (buffer.size() >= minBatchSize) {
             insertIntoDb(buffer);
             consumer.commitSync();
             buffer.clear();
         }
     }

View Code

在这个例子中，我们将消费一批消息并将它们存储在内存中。当我们积累足够多的消息后，我们再将它们批量插入到数据库中。如果我们设置offset自动提交（之前说的例子），消费将被认为是已消费的。这样会出现问题，我们的进程可能在批处理记录之后，但在它们被插入到数据库之前失败了。

为了避免这种情况，我们将在相应的记录插入数据库之后再手动提交偏移量。这样我们可以准确控制消息是成功消费的。提出一个相反的可能性：在插入数据库之后，但是在提交之前，这个过程可能会失败（即使这可能只是几毫秒，这是一种可能性）。在这种情况下，进程将获取到已提交的偏移量，并会重复插入的最后一批数据。这种方式就是所谓的“至少一次”保证，在故障情况下，可以重复。

如果您无法执行这些操作，可能会使已提交的偏移超过消耗的位置，从而导致缺少记录。使用手动偏移控制的优点是，您可以直接控制记录何时被视为“已消耗”。

注意：使用自动提交也可以“至少一次”。但是要求你必须下次调用poll（long）之前或关闭消费者之前，处理完所有返回的数据。如果操作失败，这将会导致已提交的offset超过消费的位置，从而导致丢失消息。使用手动控制offset的有点是，你可以直接控制消息何时提交。、

上面的例子使用commitSync表示所有收到的消息为”已提交"，在某些情况下，你可以希望更精细的控制，通过指定一个明确消息的偏移量为“已提交”。在下面，我们的例子中，我们处理完每个分区中的消息后，提交偏移量。

try {
         while(running) {
             ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
             for (TopicPartition partition : records.partitions()) {
                 List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
                 for (ConsumerRecord<String, String> record : partitionRecords) {
                     System.out.println(record.offset() + ": " + record.value());
                 }
                 long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
                 consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
             }
         }
     } finally {
       consumer.close();
     }

View Code

注意：已提交的offset应始终是你的程序将读取的下一条消息的offset。因此，调用commitSync（offsets）时，你应该加1个到最后处理的消息的offset。

订阅指定的分区

在前面的例子中，我们订阅我们感兴趣的topic，让kafka提供给我们平分后的topic分区。但是，在有些情况下，你可能需要自己来控制分配指定分区，例如：

如果这个消费者进程与该分区保存了某种本地状态（如本地磁盘的键值存储），则它应该只能获取这个分区的消息。
如果消费者进程本身具有高可用性，并且如果它失败，会自动重新启动（可能使用集群管理框架如YARN，Mesos，或者AWS设施，或作为一个流处理框架的一部分）。在这种情况下，不需要Kafka检测故障，重新分配分区，因为消费者进程将在另一台机器上重新启动。

要使用此模式，，你只需调用assign（Collection）消费指定的分区即可：

     String topic = "foo";
     TopicPartition partition0 = new TopicPartition(topic, 0);
     TopicPartition partition1 = new TopicPartition(topic, 1);
     consumer.assign(Arrays.asList(partition0, partition1));

一旦手动分配分区，你可以在循环中调用poll（跟前面的例子一样）。消费者分组仍需要提交offset，只是现在分区的设置只能通过调用assign修改，因为手动分配不会进行分组协调，因此消费者故障不会引发分区重新平衡。每一个消费者是独立工作的（即使和其他的消费者共享GroupId）。为了避免offset提交冲突，通常你需要确认每一个consumer实例的gorupId都是唯一的。

注意，手动分配分区（即，assgin）和动态分区分配的订阅topic模式（即，subcribe）不能混合使用。

offset存储在其他地方

消费者可以不使用kafka内置的offset仓库。可以选择自己来存储offset。要注意的是，将消费的offset和结果存储在同一个的系统中，用原子的方式存储结果和offset，但这不能保证原子，要想消费是完全原子的，并提供的“正好一次”的消费保证比kafka默认的“至少一次”的语义要更高。你需要使用kafka的offset提交功能。

这有结合的例子。

如果消费的结果存储在关系数据库中，存储在数据库的offset，让提交结果和offset在单个事务中。这样，事物成功，则offset存储和更新。如果offset没有存储，那么偏移量也不会被更新。
如果offset和消费结果存储在本地仓库。例如，可以通过订阅一个指定的分区并将offset和索引数据一起存储来构建一个搜索索引。如果这是以原子的方式做的，常见的可能是，即使崩溃引起未同步的数据丢失。索引程序从它确保没有更新丢失的地方恢复，而仅仅丢失最近更新的消息。

每个消息都有自己的offset，所以要管理自己的偏移，你只需要做到以下几点：

配置 enable.auto.commit=false
使用提供的 ConsumerRecord 来保存你的位置。
在重启时用 seek(TopicPartition, long) 恢复消费者的位置。

当分区分配也是手动完成的（像上文搜索索引的情况），这种类型的使用是最简单的。如果分区分配是自动完成的，需要特别小心处理分区分配变更的情况。可以通过调用subscribe（Collection，ConsumerRebalanceListener）和subscribe（Pattern，ConsumerRebalanceListener）中提供的ConsumerRebalanceListener实例来完成的。例如，当分区向消费者获取时，消费者将通过实现ConsumerRebalanceListener.onPartitionsRevoked（Collection）来给这些分区提交它们offset。当分区分配给消费者时，消费者通过ConsumerRebalanceListener.onPartitionsAssigned(Collection)为新的分区正确地将消费者初始化到该位置。

ConsumerRebalanceListener的另一个常见用法是清除应用已移动到其他位置的分区的缓存。

控制消费的位置

大多数情况下，消费者只是简单的从头到尾的消费消息，周期性的提交位置（自动或手动）。kafka也支持消费者去手动的控制消费的位置，可以消费之前的消息也可以跳过最近的消息。

有几种情况，手动控制消费者的位置可能是有用的。

一种场景是对于时间敏感的消费者处理程序，对足够落后的消费者，直接跳过，从最近的消费开始消费。

另一个使用场景是本地状态存储系统（上一节说的）。在这样的系统中，消费者将要在启动时初始化它的位置（无论本地存储是否包含）。同样，如果本地状态已被破坏（假设因为磁盘丢失），则可以通过重新消费所有数据并重新创建状态（假设kafka保留了足够的历史）在新的机器上重新创建。

kafka使用seek(TopicPartition, long)指定新的消费位置。用于查找服务器保留的最早和最新的offset的特殊的方法也可用（seekToBeginning(Collection) 和 seekToEnd(Collection)）。

消费者流量控制

如果消费者分配了多个分区，并同时消费所有的分区，这些分区具有相同的优先级。在一些情况下，消费者需要首先消费一些指定的分区，当指定的分区有少量或者已经没有可消费的数据时，则开始消费其他分区。

例如流处理，当处理器从2个topic获取消息并把这两个topic的消息合并，当其中一个topic长时间落后另一个，则暂停消费，以便落后的赶上来。

kafka支持动态控制消费流量，分别在future的poll(long)中使用pause(Collection) 和 resume(Collection) 来暂停消费指定分配的分区，重新开始消费指定暂停的分区。

多线程处理

Kafka消费者不是线程安全的。所有网络I/O都发生在进行调用应用程序的线程中。用户的责任是确保多线程访问正确同步的。非同步访问将导致ConcurrentModificationException。

此规则唯一的例外是wakeup()，它可以安全地从外部线程来中断活动操作。在这种情况下，将从操作的线程阻塞并抛出一个WakeupException。这可用于从其他线程来关闭消费者。以下代码段显示了典型模式：

public class KafkaConsumerRunner implements Runnable {
     private final AtomicBoolean closed = new AtomicBoolean(false);
     private final KafkaConsumer consumer;

     public void run() {
         try {
             consumer.subscribe(Arrays.asList("topic"));
             while (!closed.get()) {
                 ConsumerRecords records = consumer.poll(10000);
                 // Handle new records
             }
         } catch (WakeupException e) {
             // Ignore exception if closing
             if (!closed.get()) throw e;
         } finally {
             consumer.close();
         }
     }

     // Shutdown hook which can be called from a separate thread
     public void shutdown() {
         closed.set(true);
         consumer.wakeup();
     }
 }

View Code

在单独的线程中，可以通过设置关闭标志和唤醒消费者来关闭消费者。

closed.set(true);
consumer.wakeup();

我们没有多线程模型的例子。但留下几个操作可用来实现多线程处理消息。

每个线程一个消费者

每个线程自己的消费者实例。这里是这种方法的优点和缺点：
- PRO: 这是最容易实现的
- PRO: 因为它不需要在线程之间协调，所以通常它是最快的。
- PRO: 它按顺序处理每个分区（每个线程只处理它接受的消息）。
- CON: 更多的消费者意味着更多的TCP连接到集群（每个线程一个）。一般kafka处理连接非常的快，所以这是一个小成本。
- CON: 更多的消费者意味着更多的请求被发送到服务器，但稍微较少的数据批次可能导致I/O吞吐量的一些下降。
- CON: 所有进程中的线程总数受到分区总数的限制。
解耦消费和处理

另一个替代方式是一个或多个消费者线程，它来消费所有数据，其消费所有数据并将ConsumerRecords实例切换到由实际处理记录处理的处理器线程池来消费的阻塞队列。这个选项同样有利弊：
- PRO: 可扩展消费者和处理进程的数量。这样单个消费者的数据可分给多个处理器线程来执行，避免对分区的任何限制。
- CON: 跨多个处理器的顺序保证需要特别注意，因为线程是独立的执行，后来的消息可能比遭到的消息先处理，这仅仅是因为线程执行的运气。如果对排序没有问题，这就不是个问题。
- CON: 手动提交变得更困难，因为它需要协调所有的线程以确保处理对该分区的处理完成。

这种方法有多种玩法，例如，每个处理线程可以有自己的队列，消费者线程可以使用TopicPartitionhash到这些队列中，以确保按顺序消费，并且提交也将简化。

Kafka Streams API

2.3 Streams API

在0.10.0增加了一个新的客户端库，Kafka Stream，Kafka Stream具有Alpha的优点，你可以使用maven引入到你的项目：

    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>0.10.0.0</version>
    </dependency>

如何使用，请点击这里。（注意，@InterfaceStability.Unstable注解的类，是公共API，在未来可能改变，不保证向后兼容）

KafkaStreams客户端（0.10.1.1 API）

Kafka Streams从一个或多个输入topic进行连续的计算并输出到0或多个外部topic中。

可以通过TopologyBuilder类定义一个计算逻辑处理器DAG拓扑。或者也可以通过提供的高级别KStream DSL来定义转换的KStreamBuilder。（PS：计算逻辑其实就是自己的代码逻辑）

KafkaStreams类管理Kafka Streams实例的生命周期。一个stream实例可以在配置文件中为处理器指定一个或多个Thread。

KafkaStreams实例可以作为单个streams处理客户端（也可能是分布式的），与其他的相同应用ID的实例进行协调（无论是否在同一个进程中，在同一台机器的其他进程中，或远程机器上）。这些实例将根据输入topic分区的基础上来划分工作，以便所有的分区都被消费掉。如果实例添加或失败，所有实例将重新平衡它们之间的分区分配，以保证负载平衡。

在内部，KafkaStreams实例包含一个正常的KafkaProducer和KafkaConsumer实例，用于读取和写入，

一个简单的例子：

    Map<String, Object> props = new HashMap<>();
    props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-stream-processing-application");
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    StreamsConfig config = new StreamsConfig(props);

    KStreamBuilder builder = new KStreamBuilder();
    builder.stream("my-input-topic").mapValues(value -> value.length().toString()).to("my-output-topic");

    KafkaStreams streams = new KafkaStreams(builder, config);
    streams.start();

Kafka Connect API

Connect API实现一个连接器（connector），不断地从一些数据源系统拉取数据到kafka，或从kafka推送到宿系统（sink system）。

大多数Connect使用者不需要直接操作这个API，可以使用之前构建的连接器，不需要编写任何代码。有关Connect的其他信息，点击这里。

想实现自定义连接器的，可以看javadoc。

springbootd和kafka集成

1. 前言

对于使用Apache Kafka的Spring项目，我们在Spring核心提供了Kafka消息的集成。提供了公共的接入“模板”，作为消息发送的高级抽象层，还为消息的POJO提供支持。

3. 介绍

本示例提供一个快速的入门例子，直接运行即可。

3.1. 快速游览（Quick Tour for the Impatient）

这是Spring Kafka的五分钟速览。

先决条件：您的Apache Kafka已经安装并且运行了。然后，您必须有spring-kafka JAR及其所有依赖项。最简单的方法是在构建工具中声明一个依赖项。以下示例显示了如何使用Maven进行操作：

<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
  <version>2.4.1.RELEASE</version>
</dependency>

Gradle的引入：

compile 'org.springframework.kafka:spring-kafka:2.4.1.RELEASE'

使用Spring Boot时，如果忽略该版本，则Spring Boot将自动引入与您的Boot版本兼容的正确版本：

<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
</dependency>

Gradle的方式：

compile 'org.springframework.kafka:spring-kafka'

3.1.1. 兼容性

适用于以下的版本：

Apache Kafka Clients 2.2.0
Spring Framework 5.2.x
最小的 Java 版本: 8

3.1.2. 一个非常非常快速的例子

如下例所示，您可以使用普通Java发送和接收消息：

package com.example.kafka;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.common.serialization.IntegerDeserializer;
import org.apache.kafka.common.serialization.IntegerSerializer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.kafka.core.DefaultKafkaConsumerFactory;
import org.springframework.kafka.core.DefaultKafkaProducerFactory;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.core.ProducerFactory;
import org.springframework.kafka.listener.ContainerProperties;
import org.springframework.kafka.listener.KafkaMessageListenerContainer;
import org.springframework.kafka.listener.MessageListener;

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;

import static org.junit.jupiter.api.Assertions.assertTrue;

@SpringBootTest
class KafkaTests01 {

    private Logger logger = LoggerFactory.getLogger(getClass());
    private String group = "group01";
    private String topic1 = "topic1";

    @Test
    public void testAutoCommit() throws Exception {
        logger.info("Start auto");

        // 启动消费者
        ContainerProperties containerProps = new ContainerProperties("topic1", "topic2");
        final CountDownLatch latch = new CountDownLatch(4);
        containerProps.setMessageListener(new MessageListener<Integer, String>() {
            @Override
            public void onMessage(ConsumerRecord<Integer, String> message) {
                logger.info("received: " + message);
                latch.countDown();
            }
        });
        KafkaMessageListenerContainer<Integer, String> container = createContainer(containerProps);
        container.setBeanName("testAuto");
        container.start();  // 启动消费者

        Thread.sleep(1000); // wait a bit for the container to start

        // 启动生产者
        KafkaTemplate<Integer, String> template = createTemplate();
        template.setDefaultTopic(topic1);
        template.sendDefault(0, "foo");
        template.sendDefault(2, "bar");
        template.sendDefault(0, "baz");
        template.sendDefault(2, "qux");
        template.flush();

        assertTrue(latch.await(60, TimeUnit.SECONDS));
        container.stop(); // 关闭消费者
        logger.info("Stop auto");
    }

    private KafkaMessageListenerContainer<Integer, String> createContainer(ContainerProperties containerProps) {
        Map<String, Object> props = consumerProps();
        DefaultKafkaConsumerFactory<Integer, String> cf = new DefaultKafkaConsumerFactory<Integer, String>(props);
        KafkaMessageListenerContainer<Integer, String> container = new KafkaMessageListenerContainer<>(cf, containerProps);
        return container;
    }

    private KafkaTemplate<Integer, String> createTemplate() {
        Map<String, Object> senderProps = senderProps();
        ProducerFactory<Integer, String> pf = new DefaultKafkaProducerFactory<Integer, String>(senderProps);
        KafkaTemplate<Integer, String> template = new KafkaTemplate<>(pf);
        return template;
    }

    private Map<String, Object> consumerProps() {
        Map<String, Object> props = new HashMap<>();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, group);
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true);
        props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "100");
        props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "15000");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, IntegerDeserializer.class);
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        return props;
    }

    private Map<String, Object> senderProps() {
        Map<String, Object> props = new HashMap<>();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.RETRIES_CONFIG, 0);
        props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384);
        props.put(ProducerConfig.LINGER_MS_CONFIG, 1);
        props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432);
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        return props;
    }
}

View Code

3.1.3. 使用Java配置

你也可以使用Java的Spring配置来完成与上一个示例中相同的效果。以下示例显示了如何执行此操作：

@Autowired
private Listener listener;

@Autowired
private KafkaTemplate<Integer, String> template;

@Test
public void testSimple() throws Exception {
    template.send("annotated1", 0, "foo");
    template.flush();
    assertTrue(this.listener.latch1.await(10, TimeUnit.SECONDS));
}

@Configuration
@EnableKafka
public class Config {

    @Bean
    ConcurrentKafkaListenerContainerFactory<Integer, String>
                        kafkaListenerContainerFactory() {
        ConcurrentKafkaListenerContainerFactory<Integer, String> factory =
                                new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory());
        return factory;
    }

    @Bean
    public ConsumerFactory<Integer, String> consumerFactory() {
        return new DefaultKafkaConsumerFactory<>(consumerConfigs());
    }

    @Bean
    public Map<String, Object> consumerConfigs() {
        Map<String, Object> props = new HashMap<>();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, embeddedKafka.getBrokersAsString());
        ...
        return props;
    }

    @Bean
    public Listener listener() {
        return new Listener();
    }

    @Bean
    public ProducerFactory<Integer, String> producerFactory() {
        return new DefaultKafkaProducerFactory<>(producerConfigs());
    }

    @Bean
    public Map<String, Object> producerConfigs() {
        Map<String, Object> props = new HashMap<>();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, embeddedKafka.getBrokersAsString());
        ...
        return props;
    }

    @Bean
    public KafkaTemplate<Integer, String> kafkaTemplate() {
        return new KafkaTemplate<Integer, String>(producerFactory());
    }

}
public class Listener {

private final CountDownLatch latch1 = new CountDownLatch(1);

    @KafkaListener(id = "foo", topics = "annotated1")
    public void listen1(String foo) {
        this.latch1.countDown();
    }
}

View Code

3.1.4. Spring Boot更简单的方式

Spring Boot可以更加简单。下面的Spring Boot应用示例将三个消息发送到一个主题，然后接收它们，然后停止：

package com.example.kafka.demo03;

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.kafka.core.KafkaTemplate;

import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;

@SpringBootApplication
public class Application implements CommandLineRunner {

    public static Logger logger = LoggerFactory.getLogger(Application.class);

    public static void main(String[] args) {
        SpringApplication.run(Application.class, args).close();
    }

    @Autowired
    private KafkaTemplate<String, String> template;

    private final CountDownLatch latch = new CountDownLatch(3);

    @Override
    public void run(String... args) throws Exception {
        this.template.send("myTopic", "foo1");
        this.template.send("myTopic", "foo2");
        this.template.send("myTopic", "foo3");
        latch.await(60, TimeUnit.SECONDS);
        logger.info("All received");
    }

    @KafkaListener(topics = "myTopic")
    public void listen(ConsumerRecord<?, ?> cr) throws Exception {
        logger.info(cr.toString());
        latch.countDown();
    }
}

View Code

配置application.properties：

spring.kafka.consumer.group-id=foo
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.listener.missing-topics-fatal=false

spring.kafka.consumer.group-id指定消费者组id。
spring.kafka.consumer.auto-offset-reset确保新的消费者组能获得我们之前发送的消息，为了测试方便（生产配置latest，只获取最新的消息）。
spring.kafka.listener.missing-topics-fatal 监听的topic如果不存在，则不报错

Kafka Broker配置（0.10版）

3.1 Broker配置

基本配置如下:

broker.id
log.dirs
zookeeper.connect

下文将详细论述了主题级别配置和默认值。

名称	描述	类型	默认	有效值	重要程度
zookeeper.connect	zookeeper host string	string			高
advertised.host.name	过时的：当`advertised.listeners`或`listeners`没设置时候才使用。请改用`advertised.listeners`。Hostname发布到Zookeeper供客户端使用。在IaaS环境中，Broker可能需要绑定不同的接口。如果没有设置，将会使用`host.name`（如果配置了）。否则将从`java.net.InetAddress.getCanonicalHostName()`获取。	string	null		高
advertised.listeners	发布到Zookeeper供客户端使用监听（如果不同）。在IaaS环境中，broker可能需要绑定不同的接口。如果没设置，则使用`listeners`。	string	null		高
advertised.port	过时的：当`advertised.listeners`或`listeners`没有设置才使用。请改用`advertised.listeners`。端口发布到Zookeeper供客户端使用，在IaaS环境中，broker可能需要绑定到不同的端口。如果没有设置，将和broker绑定的同一个端口。	int	null		高
auto.create.topics.enable	启用自动创建topic	boolean	true		高
auto.leader.rebalance.enable	启用自动平衡leader。如果需要，后台线程会定期检查并触发leader平衡。	boolean	true		高
background.threads	用于各种后台处理任务的线程数	int	10	[1,...]	高
broker.id	服务器的broker id。如果未设置，将生成一个独一无二的broker id。要避免zookeeper生成的broker id和用户配置的broker id冲突，从reserved.broker.max.id + 1开始生成。	int	-1		高
compression.type	为给定topic指定最终的压缩类型。支持标准的压缩编码器（'gzip', 'snappy', 'lz4'）。也接受'未压缩'，就是没有压缩。保留由producer设置的原始的压缩编码。	string	producer		高
delete.topic.enable	启用删除topic。如果此配置已关闭，通过管理工具删除topic将没有任何效果	boolean	false		高
host.name	不赞成：当`listeners`没有设置才会使用。请改用`listeners`。如果设置，它将只绑定到此地址。如果没有设置，它将绑定到所有接口	string	""		高
leader.imbalance.check.interval.seconds	由控制器触发分区再平衡检查的频率	long	300		高
leader.imbalance.per.broker.percentage	允许每个broker的leader比例不平衡。如果每个broker的值高于此值，控制器将触发leader平衡，该值以百分比的形式指定。	int	10		高
listeners	监听列表 - 监听逗号分隔的URL列表和协议。指定hostname为0.0.0.0绑定到所有接口，将hostname留空则绑定到默认接口。合法的listener列表是：PLAINTEXT://myhost:9092,TRACE://:9091 PLAINTEXT://0.0.0.0:9092, TRACE://localhost:9093	string	null		高
log.dir	保存日志数据的目录 (补充log.dirs属性)	string	/tmp/kafka-logs		高
log.dirs	保存日志数据的目录。如果未设置，则使用log.dir中的值	string	null		高
log.flush.interval.messages	消息刷新到磁盘之前，累计在日志分区的消息数	long	9223372036854775807	[1,...]	高
log.flush.interval.ms	topic中的消息在刷新到磁盘之前保存在内存中的最大时间（以毫秒为单位），如果未设置，则使用log.flush.scheduler.interval.ms中的值	null		高
log.flush.offset.checkpoint.interval.ms	我们更新的持续记录的最后一次刷新的频率。作为日志的恢复点。	int	60000	[0,...]	高
log.flush.scheduler.interval.ms	日志刷新的频率（以毫秒为单位）检查是否有任何日志需要刷新到磁盘	long	9223372036854775807		高
log.retention.bytes	删除日志之前的最大大小	long	-1		高
log.retention.hours	删除日志文件保留的小时数（以小时为单位）。第三级是log.retention.ms属性	int	168		高
log.retention.minutes	删除日志文件之前保留的分钟数（以分钟为单位）。次于log.retention.ms属性。如果没设置，则使用log.retention.hours的值。	int	null		高
log.retention.ms	删除日志文件之前保留的毫秒数（以毫秒为单位），如果未设置，则使用log.retention.minutes的值。	long	null		高
log.roll.hours	新建一个日志段的最大时间（以小时为单位），次于log.roll.ms属性。	int	168	[1,...]	高
log.roll.jitter.hours	从logRollTimeMillis（以小时为单位）减去最大抖动，次于log.roll.jitter.ms属性。	int	0	[0,...]	高
log.roll.ms	新建一个日志段之前的最大事时间（以毫秒为单位）。如果未设置，则使用log.roll.hours的值。	long	null		高
log.segment.bytes	单个日志文件的最大大小	int	1073741824	[14,...]	高
log.segment.delete.delay.ms	从文件系统中删除文件之前的等待的时间	long	60000	[0,...]	高
message.max.bytes	服务器可以接收的消息的最大大小	int	1000012	[0,...]	高
min.insync.replicas	当producer设置acks为"all"（或"-1"）时。min.insync.replicas指定必须应答成功写入的replicas最小数。如果不能满足最小值，那么producer抛出一个异常（NotEnoughReplicas或NotEnoughReplicasAfterAppend）。当一起使用时，min.insync.replicas和acks提供最大的耐用性保证。一个典型的场景是创建一个复制因子3的topic，设置min.insync.replicas为2，并且ack是“all”。如果多数副本没有接到写入时，将会抛出一个异常。	int	1	[1,...]	高
num.io.threads	服务器用于执行网络请求的io线程数	int	8	[1,...]	高
num.network.threads	服务器用于处理网络请求的线程数。	int	3	[1,...]	高
num.recovery.threads.per.data.dir	每个数据的目录线程数，用于启动时日志恢复和关闭时flush。	int	1	[1,...]	高
num.replica.fetchers	从源broker复制消息的提取线程数。递增该值可提高follower broker的I/O的并发。	int	1		高
offset.metadata.max.bytes	offset提交关联元数据条目的最大大小	int	4096		高
offsets.commit.required.acks	commit之前需要的应答数，通常，不应覆盖默认的（-1）	short	-1		高
offsets.commit.timeout.ms	Offset提交延迟，直到所有副本都收到提交或超时。这类似于生产者请求超时。	int	5000	[1,...]	高
offsets.load.buffer.size	当加载offset到缓存时，从offset段读取的批量大小。	int	5242880	[1,...]	高
offsets.retention.check.interval.ms	检查过期的offset的频率。	long	600000	[1,...]	高
offsets.retention.minutes	offset topic的日志保留时间（分钟）	int	1440	[1,...]	高
offsets.topic.compression.codec	压缩编码器的offset topic - 压缩可以用于实现“原子”提交	int	0		高
offsets.topic.num.partitions	offset commit topic的分区数（部署之后不应更改）	int	50	[1,...]	高
offsets.topic.replication.factor	offset topic复制因子（ps：就是备份数，设置的越高来确保可用性）。为了确保offset topic有效的复制因子，第一次请求offset topic时，活的broker的数量必须最少最少是配置的复制因子数。如果不是，offset topic将创建失败或获取最小的复制因子（活着的broker，复制因子的配置）	short	3	[1,...]	高
offsets.topic.segment.bytes	offset topic段字节应该相对较小一点，以便于加快日志压缩和缓存加载	int	104857600	[1,...]	高
port	不赞成：当`listener`没有设置才使用。请改用`listeners`。该port监听和接收连接。	int	9092		高
queued.max.requests	在阻塞网络线程之前允许的排队请求数	int	500	[1,...]	高
quota.consumer.default	过时的：当默认动态的quotas没有配置或在Zookeeper时。如果每秒获取的字节比此值高，所有消费者将通过clientId/consumer区分限流。	long	9223372036854775807	[1,...]	高
quota.producer.default	过时的：当默认动态的quotas没有配置，或在zookeeper时。如果生产者每秒比此值高，所有生产者将通过clientId区分限流。	long	9223372036854775807	[1,...]	高
replica.fetch.min.bytes Minimum	每个获取响应的字节数。如果没有满足字节数，等待replicaMaxWaitTimeMs。	int	1		高
replica.fetch.wait.max.ms	跟随者副本发出每个获取请求的最大等待时间，此值应始终小于replica.lag.time.max.ms，以防止低吞吐的topic的ISR频繁的收缩。	int	500		高
replica.high.watermark. checkpoint.interval.ms	达到高“水位”保存到磁盘的频率。	long	5000		高
replica.lag.time.max.ms	如果一个追随者没有发送任何获取请求或至少在这个时间的这个leader的没有消费完。该leader将从isr中移除这个追随者。	long	10000		高
replica.socket.receive.buffer.bytes	用于网络请求的socket接收缓存区	int	65536		高
replica.socket.timeout.ms	网络请求的socket超时，该值最少是replica.fetch.wait.max.ms	int	30000		高
request.timeout.ms	该配置控制客户端等待请求的响应的最大时间，。如果超过时间还没收到消费。客户端将重新发送请求，如果重试次数耗尽，则请求失败。	int	30000		高
socket.receive.buffer.bytes	socket服务的SO_RCVBUF缓冲区。如果是-1，则默认使用OS的。	int	102400		高
socket.request.max.bytes	socket请求的最大字节数	int	104857600	[1,...]	高
socket.send.buffer.bytes	socket服务的SO_SNDBUF缓冲区。如果是-1，则默认使用OS的。	int	102400		高
unclean.leader.election.enable	是否启用不在ISR中的副本参与选举leader的最后的手段。这样做有可能丢失数据。	boolean	true		高
zookeeper.connection.timeout.ms	连接zookeeper的最大等待时间，如果未设置，则使用zookeeper.session.timeout.ms。	int	null		高
zookeeper.session.timeout.ms	Zookeeper会话的超时时间	int	6000		高
zookeeper.set.acl	设置客户端使用安全的ACL	boolean	false		高
broker.id.generation.enable	启用自动生成broker id。启用该配置时应检查reserved.broker.max.id。	boolean	true		中等
broker.rack	broker机架，用于机架感知副本分配的失败容错。例如：`RACK1`, `us-east-1d`	string	null		中等
connections.max.idle.ms Idle	连接超时：闲置时间超过该设置，则服务器socket处理线程关闭这个连接。	long	600000		中等
controlled.shutdown.enable	启用服务器的关闭控制。	boolean	true		中等
controlled.shutdown.max.retries	控制因多种原因导致的shutdown失败，当这样失败发生，尝试重试的次数	int	3		中等
controlled.shutdown.retry.backoff.ms	在每次重试之前，系统需要时间从导致先前故障的状态（控制器故障转移，复制延迟等）恢复。此配置是重试之前等待的时间数。	long	5000		中等
controller.socket.timeout.ms	控制器到broker通道的sockt超时时间	int	30000		中
default.replication.factor	自动创建topic的默认的副本数	int	1		中
fetch.purgatory.purge.interval.requests	拉取请求清洗间隔（请求数）	int	1000		中
group.max.session.timeout.ms	已注册的消费者允许的最大会话超时时间，设置的时候越长使消费者有更多时间去处理心跳之间的消息。但察觉故障的时间也拉长了。	int	300000		中
group.min.session.timeout.ms	已经注册的消费者允许最小的会话超时时间，更短的时间去快速的察觉到故障，代价是频繁的心跳，这可能会占用大量的broker资源。	int	6000		中
inter.broker.protocol.version	指定broker内部通讯使用的版本。通常在更新broker时使用。有效的值为：0.8.0, 0.8.1, 0.8.1.1, 0.8.2, 0.8.2.0, 0.8.2.1, 0.9.0.0, 0.9.0.1。查看ApiVersion找到的全部列表。	string	0.10.1-IV2		中
log.cleaner.backoff.ms	当没有日志要清理时，休眠的时间	long	15000	[0,...]	中
log.cleaner.dedupe.buffer.size	用于日志去重的内存总量（所有cleaner线程）	long	134217728		中
log.cleaner.delete.retention.ms	删除记录保留多长时间？	long	86400000		中
log.cleaner.enable	在服务器上启用日志清洗处理？如果使用的任何topic的cleanup.policy=compact包含内部的offset topic，应启动。如果禁用，那些topic将不会被压缩并且会不断的增大。	boolean	true		中
log.cleaner.io.buffer.load.factor	日志cleaner去重缓冲负载因子。去重缓冲区的百分比，较高的值将允许同时清除更多的日志，但将会导致更多的hash冲突。	double	0.9		中
log.cleaner.io.buffer.size	所有日志清洁器线程I/O缓存的总内存	int	524288	[0,...]	中
log.cleaner.io.max.bytes.per.second	日志清理器限制，以便其读写i/o平均小与此值。	double	1.7976931348623157E308		中
log.cleaner.min.cleanable.ratio	脏日志与日志的总量的最小比率，以符合清理条件	double	0.5		中
log.cleaner.min.compaction.lag.ms	一条消息在日志保留不压缩的最小时间，仅适用于正在压缩的日志。	long	0		中
log.cleaner.threads	用于日志清除的后台线程数	int	1	[0,...]	中
log.cleanup.policy	超过保留时间段的默认清除策略。逗号分隔的有效的策略列表。有效的策略有：“delete”和“compact”	list	[delete]	[compact, delete]	中
log.index.interval.bytes	添加一个条目到offset的间隔	index	int	4096	[0,...]	中
log.index.size.max.bytes	offset index的最大大小（字节）	int	10485760	[4,...]	中
log.message.format.version	指定追加到日志中的消息格式版本。例如： 0.8.2, 0.9.0.0, 0.10.0。通过设置一个特定消息格式版本，用户需要保证磁盘上所有现有的消息小于或等于指定的版本。错误的设置将导致旧版本的消费者中断，因为消费者接收一个不理解的消息格式。	string	0.10.1-IV2		中
log.message.timestamp.difference.max.ms	如果log.message.timestamp.type=CreateTime，broker接收消息时的时间戳和消息中指定的时间戳之间允许的最大差异。如果时间戳超过此阈值，则消息将被拒绝。如果log.message.timestamp.type=LogAppendTime，则此配置忽略。	long	9223372036854775807	[0,...]	中
log.message.timestamp.type	定义消息中的时间戳是消息创建时间或日志追加时间。该值可设置为`CreateTime` 或 `LogAppendTime`	string	CreateTime	[CreateTime, LogAppendTime]	中
log.preallocate	在创建新段时预分配文件？如果你在Windowns上使用kafka，你可能需要设置它为true。	boolean	false		中
log.retention.check.interval.ms	日志清除程序检查日志是否满足被删除的频率（以毫秒为单位）	long	300000	[1,...]	中
max.connections.per.ip	允许每个ip地址的最大连接数。	int	2147483647	[1,...]	中
max.connections.per.ip.overrides	per-ip或hostname覆盖默认最大连接数	string	""		中
num.partitions	topic的默认分区数	int	1	[1,...]	中
principal.builder.class	实现PrincipalBuilder接口类的完全限定名，该接口目前用于构建与SSL SecurityProtocol连接的Principal。	class	class org.apache.kafka. common.security.auth .DefaultPrincipalBuilder		中
producer.purgatory.purge.interval.requests	生产者请求purgatory的清洗间隔（请求数）	int	1000		中
replica.fetch.backoff.ms	当拉取分区发生错误时休眠的时间	1000	[0,...]		中
replica.fetch.max.bytes	拉取每个分区的消息的字节数。这不是绝对的最大值，如果提取的第一个非空分区中的第一个消息大于这个值，则消息仍然返回，以确保进展。通过message.max.bytes (broker配置)或max.message.bytes (topic配置)定义broker接收的最大消息大小。	int	1048576	[0,...]	中
replica.fetch.response.max.bytes	预计整个获取响应的最大字节数，这不是绝对的最大值，如果提取的第一个非空分区中的第一个消息大于这个值，则消息仍然返回，以确保进展。通过message.max.bytes (broker配置)或max.message.bytes (topic配置)定义broker接收的最大消息大小。	int	10485760	[0,...]	中
reserved.broker.max.id	broker.id的最大数	int	1000	[0,...]	中
sasl.enabled.mechanisms	可用的SASL机制列表，包含任何可用的安全提供程序的机制。默认情况下只有GSSAPI是启用的。	list	[GSSAPI]		中
sasl.kerberos.kinit.cmd	Kerberos kinit 命令路径。	string	/usr/bin/kinit		中
sasl.kerberos.min.time.before.relogin	登录线程在刷新尝试的休眠时间。	long	60000		中
sasl.kerberos.principal.to.local.rules	principal名称映射到一个短名称（通常是操作系统用户名）。按顺序，使用与principal名称匹配的第一个规则将其映射其到短名称。忽略后面的规则。默认情况下，{username}/{hostname}@{REALM} 映射到 {username}。	list	[DEFAULT]		中
sasl.kerberos.service.name	Kafka运行的Kerberos principal名称。可以在JAAS或Kafka的配置文件中定义。	string	null		中
sasl.kerberos.ticket.renew.jitter	添加到更新时间的随机抖动的百分比	time. double	0.05		中
sasl.kerberos.ticket.renew.window.factor	登录线程休眠，直到从上次刷新到ticket的到期的时间已到达（指定窗口因子），在此期间它将尝试更新ticket。	double	0.8		中
sasl.mechanism.inter.broker.protocol	SASL机制，用于broker之间的通讯，默认是GSSAPI。	string	GSSAPI		中
security.inter.broker.protocolSecurity	broker之间的通讯协议，有效值有：PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL。	string	PLAINTEXT		中
ssl.cipher.suites	密码套件列表。认证，加密，MAC和秘钥交换算法的组合，用于使用TLS或SSL的网络协议`交涉`网络连接的安全设置，默认情况下，支持所有可用的密码套件。	list	null		中
ssl.client.auth	配置请求客户端的broker认证。常见的设置： `ssl.client.auth=required` 需要客户端认证。 `ssl.client.auth=requested` 客户端认证可选，不同于requested ，客户端可选择不提供自身的身份验证信息 * `ssl.client.auth=none` 不需要客户端身份认证	string	none	[required, requested, none]	中
ssl.enabled.protocols	已启用的SSL连接协议列表。	list	[TLSv1.2, TLSv1.1, TLSv1]		中
ssl.key.password	秘钥库文件中的私钥密码。对客户端是可选的。	password	null		中
ssl.keymanager.algorithm	用于SSL连接的密钥管理工厂算法。默认值是Java虚拟机的密钥管理工厂算法。	string	SunX509		中
ssl.keystore.location	密钥仓库文件的位置。客户端可选，并可用于客户端的双向认证。	string	null		中
ssl.keystore.password	密钥仓库文件的仓库密码。客户端可选，只有ssl.keystore.location配置了才需要。	password	null		中
ssl.keystore.type	密钥仓库文件的格式。客户端可选。	string	JKS		中
ssl.protocol	用于生成SSLContext，默认是TLS，适用于大多数情况。允许使用最新的JVM，LS, TLSv1.1 和TLSv1.2。 SSL，SSLv2和SSLv3 老的JVM也可能支持，由于有已知的安全漏洞，不建议使用。	string	TLS		中
ssl.provider	用于SSL连接的安全提供程序的名称。默认值是JVM的安全程序。	string	null		中
ssl.trustmanager.algorithm	信任管理工厂用于SSL连接的算法。默认为Java虚拟机配置的信任算法。	string	PKIX		中
ssl.truststore.location	信任仓库文件的位置	string	null		中
ssl.truststore.password	信任仓库文件的密码	password	null		中
ssl.truststore.type	信任仓库文件的文件格式	string	JKS		中
authorizer.class.name	用于认证的授权程序类	string	""		低
metric.reporters	度量报告的类列表，通过实现`MetricReporter`接口，允许插入新度量标准类。JmxReporter包含注册JVM统计。	list	[]		低
metrics.num.samples	维持计算度量的样本数。	int	2	[1,...]	低
metrics.sample.window.ms	计算度量样本的时间窗口	long	30000	[1,...]	低
quota.window.num	在内存中保留客户端限额的样本数	int	11	[1,...]	低
quota.window.size.seconds	每个客户端限额的样本时间跨度	int	1	[1,...]	低
replication.quota.window.num	在内存中保留副本限额的样本数	int	11	[1,...]	低
replication.quota.window.size.seconds	每个副本限额样本数的时间跨度	int	1	[1,...]	低
ssl.endpoint.identification.algorithm	端点身份标识算法，使用服务器证书验证服务器主机名。	string	null		低
ssl.secure.random.implementation	用于SSL加密操作的SecureRandom PRNG实现。	string	null		低
zookeeper.sync.time.ms	ZK follower可落后与leader多久。	int	2000		低

更多关于broker配置的详情，可以在scala类中的kafka.server.KafkaConfig找到。

可查看之前的版本：Kafka Broker配置（0.8.2）

Kafka Topic配置

3.2 Topic配置

与topic相关的配置，服务器的默认值，也可可选择的覆盖指定的topic。如果没有给出指定topic的配置，则将使用服务器默认值。可以通过-config选项在topic创建时设置。此示例使用自定义最大消息的大小和刷新率，创建一个名为my-topic的topic：

> bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1
--replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1

也可以使用alter configs命令修改或设置。此示例修改更新my-topic的最大的消息大小：

> bin/kafka-configs.sh --zookeeper localhost:2181 --entity-type topics --entity-name my-topic
    --alter --add-config max.message.bytes=128000

你可以执行以下命令验证结果

> bin/kafka-configs.sh --zookeeper localhost:2181 --entity-type topics --entity-name my-topic --describe

移除：

> bin/kafka-configs.sh --zookeeper localhost:2181  --entity-type topics --entity-name my-topic --alter --delete-config max.message.bytes

以下是topic级配置。是服务器的默认配置。服务器默认配置值仅适用于topic。

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	SERVER DEFAULT PROPERTY	IMPORTANCE
cleanup.policy	“delete”或“compact”。指定在旧的日志段的保留策略。默认策略（“delete”），将达到保留时间或大小限制的日志废弃。 “compact”则压缩日志。	list	delete	[compact, delete]	log.cleanup.policy	medium
compression.type	针对指定的topic设置最终的压缩方式。标准的压缩格式有'gzip', 'snappy', lz4。还可以设置'uncompressed',就是不压缩；设置为'producer'这意味着保留生产者设置的原始压缩编解码。	string	producer	[uncompressed, snappy, lz4, gzip, producer]	compression.type	medium
delete.retention.ms	保留删除消息压缩topic的删除标记的时间。此设置还给出消费者如果从offset 0开始读取并确保获得最终阶段的有效快照的时间范围（否则，在完成扫描之前可能已经回收了）。	long	86400000	[0,...]	log.cleaner.delete.retention.ms	medium
file.delete.delay.ms	从文件系统中删除文件之前等待的时间	long	60000	[0,...]	log.segment.delete.delay.ms	medium
flush.messages	此设置允许指定我们强制fsync写入日志的数据的间隔。例如，如果这被设置为1，我们将在每个消息之后fsync; 如果是5，我们将在每五个消息之后fsync。一般，我们建议不要设置它，使用复制特性来保持持久性，并允许操作系统的后台刷新功能更高效。可以在每个topic的基础上覆盖此设置（请参阅每个主题的配置部分）。	medium	long	9223372036854775807	[0,...]	log.flush.interval.messages
flush.ms	此设置允许我们强制fsync写入日志的数据的时间间隔。例如，如果这设置为1000，那么在1000ms过去之后，我们将fsync。一般，我们建议不要设置它，并使用复制来保持持久性，并允许操作系统的后台刷新功能，因为它更有效率	long	9223372036854775807	[0,...]	log.flush.interval.ms	medium
follower.replication.throttled.replicas	follower复制限流列表。该列表应以[PartitionId]的形式描述一组副本：[BrokerId]，[PartitionId]：[BrokerId]：...或者通配符'*'可用于限制此topic的所有副本。	list	""	[partitionId],[brokerId]:[partitionId],[brokerId]:...	follower.replication.throttled.replicas	medium
index.interval.bytes	此设置控制Kafka向其offset索引添加索引条目的频率。默认设置确保我们大致每4096个字节索引消息。更多的索引允许读取更接近日志中的确切位置，但使索引更大。你不需要改变这个值。	int	4096	[0,...]	log.index.interval.bytes	medium
leader.replication.throttled.replicas	在leader方面进行限制的副本列表。该列表设置以[PartitionId]的形式描述限制副本：[PartitionId]:[BrokerId],[PartitionId]:[BrokerId]:...或使用通配符‘*’限制该topic的所有副本。	list	""	[partitionId],[brokerId]:[partitionId],[brokerId]:...	leader.replication.throttled.replicas	medium
max.message.bytes	kafka允许的最大的消息批次大小。如果增加此值，并且消费者的版本比0.10.2老，那么消费者的提取的大小也必须增加，以便他们可以获取大的消息批次。在最新的消息格式版本中，消息总是分组批量来提高效率。在之前的消息格式版本中，未压缩的记录不会分组批量，并且此限制仅适用于该情况下的单个消息。	int	1000012	[0,...]	message.max.bytes	medium
message.format.version	指定消息附加到日志的消息格式版本。该值应该是一个有效的ApiVersion。例如：0.8.2, 0.9.0.0, 0.10.0，更多细节检查ApiVersion。通过设置特定的消息格式版本，并且磁盘上的所有现有消息都小于或等于指定版本。不正确地设置此值将导致消费者使用旧版本，因为他们将接收到“不认识”的格式的消息。	string	0.11.0-IV2		log.message.format.version	medium
min.cleanable.dirty.ratio	此配置控制日志压缩程序将尝试清除日志的频率(假设启用了日志压缩)。默认情况下，我们将避免清理超过50％日志被压缩的日志。该比率限制日志中浪费的最大空间重复(在最多50％的日志中可以是重复的50％)。更高的比率意味着更少，更有效的清洁，但意味着日志中的浪费更多。	double	0.5	[0,...,1]	log.cleaner.min.cleanable.ratio	medium
min.compaction.lag.ms	消息在日志中保持不压缩的最短时间。仅适用于正在压缩的日志。	long	0	[0,...]	log.cleaner.min.compaction.lag.ms	medium
min.insync.replicas	当生产者设置应答为"all"(或“-1”)时，此配置指定了成功写入的副本应答的最小数。如果没满足此最小数，则生产者将引发异常(NotEnoughReplicas或NotEnoughReplicasAfterAppend) 当min.insync.replicas和acks强制更大的耐用性时。典型的情况是创建一个副本为3的topic，将min.insync.replicas设置为2，并设置acks为“all”。如果多数副本没有收到写入，这将确保生产者引发异常。	int	1	[1,...]	min.insync.replicas	medium
preallocate	如果我们在创建新的日志段时在磁盘上预分配该文件，那么设为True。	boolean	false		log.preallocate	medium
retention.bytes	如果我们使用“删除”保留策略，则此配置将控制日志可以增长的最大大小，之后我们将丢弃旧的日志段以释放空间。默认情况下，没有设置大小限制则仅限于时间限制。	long	-1		log.retention.bytes	medium
retention.ms	如果我们使用“删除”保留策略，则此配置控制我们将保留日志的最长时间，然后我们将丢弃旧的日志段以释放空间。这代表SLA消费者必须读取数据的时间长度。	long	604800000		log.retention.ms	medium
segment.bytes	此配置控制日志的段文件大小。一次保留和清理一个文件，因此较大的段大小意味着较少的文件，但对保留率的粒度控制较少。	int	1073741824	[14,...]	log.segment.bytes	medium
segment.index.bytes	此配置控制offset映射到文件位置的索引的大小。我们预先分配此索引文件，并在日志滚动后收缩它。通常不需要更改此设置。	int	10485760	[0,...]	log.index.size.max.bytes	medium
segment.jitter.ms	从计划的段滚动时间减去最大随机抖动，以避免异常的段滚动	long	0	[0,...]	log.roll.jitter.ms	medium
segment.ms	此配置控制Kafka强制日志滚动的时间段，以确保保留可以删除或压缩旧数据，即使段文件未满。	long	604800000	[0,...]	log.roll.ms	medium
unclean.leader.election.enable	是否将不在ISR中的副本作为最后的手段选举为leader，即使这样做可能会导致数据丢失。	boolean	false		unclean.leader.election.enable	medium

Kafka Producer配置

3.3 生产者配置

java生产者配置：

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE
bootstrap.servers	host/port列表，用于初始化建立和Kafka集群的连接。列表格式为host1:port1,host2:port2,....，无需添加所有的集群地址，kafka会根据提供的地址发现其他的地址（你可以多提供几个，以防提供的服务器关闭）	list			high
key.serializer	key的序列化类（实现序列化接口）	class			high
value.serializer	value的序列化类（实现序列化接口）	class			high
acks	生产者需要leader确认请求完成之前接收的应答数。此配置控制了发送消息的耐用性，支持以下配置： acks=0 如果设置为0，那么生产者将不等待任何消息确认。消息将立刻添加到socket缓冲区并考虑发送。在这种情况下不能保障消息被服务器接收到。并且重试机制不会生效（因为客户端不知道故障了没有）。每个消息返回的offset始终设置为-1。 acks=1，这意味着leader写入消息到本地日志就立即响应，而不等待所有follower应答。在这种情况下，如果响应消息之后但follower还未复制之前leader立即故障，那么消息将会丢失。 acks=all 这意味着leader将等待所有副本同步后应答消息。此配置保障消息不会丢失（只要至少有一个同步的副本或者）。这是最强壮的可用性保障。等价于acks=-1。	string	1	[all, -1, 0, 1]	high
buffer.memory	生产者用来缓存等待发送到服务器的消息的内存总字节数。如果消息发送比可传递到服务器的快，生产者将阻塞`max.block.ms`之后，抛出异常。此设置应该大致的对应生产者将要使用的总内存，但不是硬约束，因为生产者所使用的所有内存都用于缓冲。一些额外的内存将用于压缩（如果启动压缩），以及用于保持发送中的请求。	long	33554432	[0,...]	high
compression.type	数据压缩的类型。默认为空（就是不压缩）。有效的值有 none，gzip，snappy, 或 lz4。压缩全部的数据批，因此批的效果也将影响压缩的比率（更多的批次意味着更好的压缩）。	string	none		high
retries	设置一个比零大的值，客户端如果发送失败则会重新发送。注意，这个重试功能和客户端在接到错误之后重新发送没什么不同。如果max.in.flight.requests.per.connection没有设置为1，有可能改变消息发送的顺序，因为如果2个批次发送到一个分区中，并第一个失败了并重试，但是第二个成功了，那么第二个批次将超过第一个。	int	0	[0,...,2147483647]	high
ssl.key.password	密钥仓库文件中的私钥的密码。	password	null		high
ssl.keystore.location	密钥仓库文件的位置。可用于客户端的双向认证。	string	null		high
ssl.keystore.password	密钥仓库文件的仓库密码。只有配置了ssl.keystore.location时才需要。
	password	null		high
ssl.truststore.location	信任仓库的位置	string	null		high
ssl.truststore.password	信任仓库文件的密码	password	null		high
batch.size	当多个消息要发送到相同分区的时，生产者尝试将消息批量打包在一起，以减少请求交互。这样有助于客户端和服务端的性能提升。该配置的默认批次大小（以字节为单位）：不会打包大于此配置大小的消息。发送到broker的请求将包含多个批次，每个分区一个，用于发送数据。较小的批次大小有可能降低吞吐量（批次大小为0则完全禁用批处理）。一个非常大的批次大小可能更浪费内存。因为我们会预先分配这个资源。	int	16384	[0,...]	medium
client.id	当发出请求时传递给服务器的id字符串。这样做的目的是允许服务器请求记录记录这个【逻辑应用名】，这样能够追踪请求的源，而不仅仅只是ip/prot。	string	""		medium
connections.max.idle.ms	多少毫秒之后关闭闲置的连接。	long	540000		medium
linger.ms	生产者组将发送的消息组合成单个批量请求。正常情况下，只有消息到达的速度比发送速度快的情况下才会出现。但是，在某些情况下，即使在适度的负载下，客户端也可能希望减少请求数量。此设置通过添加少量人为延迟来实现。- 也就是说，不是立即发出一个消息，生产者将等待一个给定的延迟，以便和其他的消息可以组合成一个批次。这类似于Nagle在TCP中的算法。此设置给出批量延迟的上限：一旦我们达到分区的batch.size值的记录，将立即发送，不管这个设置如何，但是，如果比这个小，我们将在指定的“linger”时间内等待更多的消息加入。此设置默认为0（即无延迟）。假设，设置 linger.ms=5，将达到减少发送的请求数量的效果，但对于在没有负载情况，将增加5ms的延迟。	long	0	[0,...]	medium
max.block.ms	该配置控制 KafkaProducer.send() 和 KafkaProducer.partitionsFor() 将阻塞多长时间。此外这些方法被阻止，也可能是因为缓冲区已满或元数据不可用。在用户提供的序列化程序或分区器中的锁定不会计入此超时。	long	60000	[0,...]	medium
max.request.size	请求的最大大小（以字节为单位）。此设置将限制生产者的单个请求中发送的消息批次数，以避免发送过大的请求。这也是最大消息批量大小的上限。请注意，服务器拥有自己的批量大小，可能与此不同。	int	1048576	[0,...]	medium
partitioner.class	实现Partitioner接口的的Partitioner类。	class	org.apache.kafka.clients.producer.internals.DefaultPartitioner		medium
receive.buffer.bytes	读取数据时使用的TCP接收缓冲区(SO_RCVBUF)的大小。如果值为-1，则将使用OS默认值。	int	32768	[-1,...]	medium
request.timeout.ms	该配置控制客户端等待请求响应的最长时间。如果在超时之前未收到响应，客户端将在必要时重新发送请求，如果重试耗尽，则该请求将失败。这应该大于replica.lag.time.max.ms，以减少由于不必要的生产者重试引起的消息重复的可能性。	int	30000	[0,...]	medium
sasl.jaas.config	JAAS配置文件使用的格式的SASL连接的JAAS登录上下文参数。这里描述JAAS配置文件格式。该值的格式为：'（=）*;'	password	null		medium
sasl.kerberos.service.name	Kafka运行的Kerberos主体名称。可以在Kafka的JAAS配置或Kafka的配置中定义。	string	null		medium
sasl.mechanism	SASL机制用于客户端连接。这是安全提供者可用与任何机制。GSSAPI是默认机制。	string	GSSAPI		medium
security.protocol	用于与broker通讯的协议。有效值为：PLAINTEXT，SSL，SASL_PLAINTEXT，SASL_SSL。	string	PLAINTEXT		medium
send.buffer.bytes	发送数据时，用于TCP发送缓存（SO_SNDBUF）的大小。如果值为 -1，将默认使用系统的。	int	131072	[-1,...]	medium
ssl.enabled.protocols	启用SSL连接的协议列表。	list	TLSv1.2,TLSv1.1,TLSv1		medium
ssl.keystore.type	密钥存储文件的文件格式。对于客户端是可选的。	string	JKS		medium
ssl.protocol	最近的JVM中允许的值是TLS，TLSv1.1和TLSv1.2。较旧的JVM可能支持SSL，SSLv2和SSLv3，但由于已知的安全漏洞，不建议使用SSL。	string	TLS		medium
ssl.provider	用于SSL连接的安全提供程序的名称。默认值是JVM的默认安全提供程序。	string	null		medium
ssl.truststore.type	信任仓库文件的文件格式。	string	JKS		medium
enable.idempotence	当设置为‘true’，生产者将确保每个消息正好一次复制写入到stream。如果‘false’，由于broker故障，生产者重试。即，可以在流中写入重试的消息。此设置默认是‘false’。请注意，启用幂等式需要将max.in.flight.requests.per.connection设置为1，重试次数不能为零。另外acks必须设置为“全部”。如果这些值保持默认值，我们将覆盖默认值。如果这些值设置为与幂等生成器不兼容的值，则将抛出一个ConfigException异常。如果这些值设置为与幂等生成器不兼容的值，则将抛出一个ConfigException异常。	boolean	false		low
interceptor.classes	实现ProducerInterceptor接口，你可以在生产者发布到Kafka群集之前拦截（也可变更）生产者收到的消息。默认情况下没有拦截器。	list	null		low
max.in.flight.requests.per.connection	阻塞之前，客户端单个连接上发送的未应答请求的最大数量。注意，如果此设置设置大于1且发送失败，则会由于重试（如果启用了重试）会导致消息重新排序的风险。	int	5	[1,...]	low
metadata.max.age.ms	在一段时间段之后（以毫秒为单位），强制更新元数据，即使我们没有看到任何分区leader的变化，也会主动去发现新的broker或分区。	long	300000	[0,...]	low
metric.reporters	用作metrics reporters（指标记录员）的类的列表。实现MetricReporter接口，将受到新增加的度量标准创建类插入的通知。 JmxReporter始终包含在注册JMX统计信息中。	list	""		low
metrics.num.samples	维护用于计算度量的样例数量。	int	2	[1,...]	low
metrics.recording.level	指标的最高记录级别。	string	INFO	[INFO, DEBUG]	low
metrics.sample.window.ms	度量样例计算上	long	30000	[0,...]	low
reconnect.backoff.max.ms	重新连接到重复无法连接的代理程序时等待的最大时间（毫秒）。如果提供，每个主机的回退将会连续增加，直到达到最大值。计算后退增加后，增加20％的随机抖动以避免连接风暴。	long	1000	[0,...]	low
reconnect.backoff.ms	尝试重新连接到给定主机之前等待的基本时间量。这避免了在循环中高频率的重复连接到主机。这种回退适应于客户端对broker的所有连接尝试。	long	50	[0,...]	low
retry.backoff.ms	尝试重试指定topic分区的失败请求之前等待的时间。这样可以避免在某些故障情况下高频次的重复发送请求。	long	100	[0,...]	low
sasl.kerberos.kinit.cmd	Kerberos kinit 命令路径。	string	/usr/bin/kinit		low
sasl.kerberos.min.time.before.relogin	Login线程刷新尝试之间的休眠时间。	long	60000		low
sasl.kerberos.ticket.renew.jitter	添加更新时间的随机抖动百分比。	double	0.05		low
sasl.kerberos.ticket.renew.window.factor	登录线程将睡眠，直到从上次刷新ticket到期时间的指定窗口因子为止，此时将尝试续订ticket。	double	0.8		low
ssl.cipher.suites	密码套件列表。这是使用TLS或SSL网络协议来协商用于网络连接的安全设置的认证，加密，MAC和密钥交换算法的命名组合。默认情况下，支持所有可用的密码套件。	list	null		low
ssl.endpoint.identification.algorithm	使用服务器证书验证服务器主机名的端点识别算法。	string	null		low
ssl.keymanager.algorithm	用于SSL连接的密钥管理因子算法。默认值是为Java虚拟机配置的密钥管理器工厂算法。
	string	SunX509		low
ssl.secure.random.implementation	用于SSL加密操作的SecureRandom PRNG实现。	string	null		low
ssl.trustmanager.algorithm	用于SSL连接的信任管理因子算法。默认值是JAVA虚拟机配置的信任管理工厂算法。	string	PKIX		low
transaction.timeout.ms	生产者在主动中止正在进行的交易之前，交易协调器等待事务状态更新的最大时间（以ms为单位）。如果此值大于broker中的max.transaction.timeout.ms设置，则请求将失败，并报“InvalidTransactionTimeout”错误。	int	60000		low
transactional.id	用于事务传递的TransactionalId。这样可以跨多个生产者会话的可靠性语义，因为它允许客户端保证在开始任何新事务之前使用相同的TransactionalId的事务已经完成。如果没有提供TransactionalId，则生产者被限制为幂等传递。请注意，如果配置了TransactionalId，则必须启用enable.idempotence。默认值为空，这意味着无法使用事务。	string	null	non-empty string	low

对于那些对传统Scala生产者配置感兴趣的用户，可以在这里找到。

Kafka Consumer配置

3.4 kafka消费者配置

在0.9.0.0中，我们引入了新的Java消费者来替代早期基于Scala的简单和高级消费者。新老客户端的配置如下。

3.4.1 新消费者配置

新消费者配置：（注意，右面是可拖动的）

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE
bootstrap.servers	host/port,用于和kafka集群建立初始化连接。因为这些服务器地址仅用于初始化连接，并通过现有配置的来发现全部的kafka集群成员（集群随时会变化），所以此列表不需要包含完整的集群地址（但尽量多配置几个，以防止配置的服务器宕机）。	list			high
key.deserializer	key的解析序列化接口实现类（Deserializer）。	class			high
value.deserializer	value的解析序列化接口实现类（Deserializer）	class			high
fetch.min.bytes	服务器哦拉取请求返回的最小数据量，如果数据不足，请求将等待数据积累。默认设置为1字节，表示只要单个字节的数据可用或者读取等待请求超时，就会应答读取请求。将此值设置的越大将导致服务器等待数据累积的越长，这可能以一些额外延迟为代价提高服务器吞吐量。	int	1	[0,...]	high
group.id	此消费者所属消费者组的唯一标识。如果消费者用于订阅或offset管理策略的组管理功能，则此属性是必须的。	string	""		high
heartbeat.interval.ms	当使用Kafka的分组管理功能时，心跳到消费者协调器之间的预计时间。心跳用于确保消费者的会话保持活动状态，并当有新消费者加入或离开组时方便重新平衡。该值必须必比session.timeout.ms小，通常不高于1/3。它可以调整的更低，以控制正常重新平衡的预期时间。	int	3000		high
max.partition.fetch.bytes	服务器将返回每个分区的最大数据量。如果拉取的第一个非空分区中第一个消息大于此限制，则仍然会返回消息，以确保消费者可以正常的工作。broker接受的最大消息大小通过`message.max.bytes`（broker config）或`max.message.bytes` (topic config)定义。参阅fetch.max.bytes以限制消费者请求大小。	int	1048576	[0,...]	high
session.timeout.ms	用于发现消费者故障的超时时间。消费者周期性的发送心跳到broker，表示其还活着。如果会话超时期满之前没有收到心跳，那么broker将从分组中移除消费者，并启动重新平衡。请注意，该值必须在broker配置的`group.min.session.timeout.ms`和`group.max.session.timeout.ms`允许的范围内。	int	10000		high
ssl.key.password	密钥存储文件中的私钥的密码。客户端可选	password	null		high
ssl.keystore.location	密钥存储文件的位置，这对于客户端是可选的，并且可以用于客户端的双向认证。	string	null		high
ssl.keystore.password	密钥仓库文件的仓库密码。客户端可选，只有ssl.keystore.location配置了才需要。	password	null		high
ssl.truststore.location	信任仓库文件的位置	string	null		high
ssl.truststore.password	信任仓库文件的密码	password	null		high
auto.offset.reset	当Kafka中没有初始offset或如果当前的offset不存在时（例如，该数据被删除了），该怎么办。最早：自动将偏移重置为最早的偏移最新：自动将偏移重置为最新偏移 none：如果消费者组找到之前的offset，则向消费者抛出异常其他：抛出异常给消费者。	string	latest	[latest, earliest, none]	medium
connections.max.idle.ms	指定在多少毫秒之后关闭闲置的连接	long	540000		medium
enable.auto.commit	如果为true，消费者的offset将在后台周期性的提交	boolean	true		medium
exclude.internal.topics	内部topic的记录（如偏移量）是否应向消费者公开。如果设置为true，则从内部topic接受记录的唯一方法是订阅它。	boolean	true		medium
fetch.max.bytes	服务器为拉取请求返回的最大数据值。这不是绝对的最大值，如果在第一次非空分区拉取的第一条消息大于该值，该消息将仍然返回，以确保消费者继续工作。接收的最大消息大小通过message.max.bytes (broker config) 或 max.message.bytes (topic config)定义。注意，消费者是并行执行多个提取的。	int	52428800	[0,...]	medium
max.poll.interval.ms	使用消费者组管理时poll()调用之间的最大延迟。消费者在获取更多记录之前可以空闲的时间量的上限。如果此超时时间期满之前poll()没有调用，则消费者被视为失败，并且分组将重新平衡，以便将分区重新分配给别的成员。	int	300000	[1,...]	medium
max.poll.records	在单次调用`poll()`中返回的最大记录数。	int	500	[1,...]	medium
partition.assignment.strategy	当使用组管理时，客户端将使用分区分配策略的类名来分配消费者实例之间的分区所有权	list	class org.apache.kafka .clients.consumer .RangeAssignor		medium
receive.buffer.bytes	读取数据时使用的TCP接收缓冲区（SO_RCVBUF）的大小。如果值为-1，则将使用OS默认值。	int	65536	[-1,...]	medium
request.timeout.ms	配置控制客户端等待请求响应的最长时间。如果在超时之前未收到响应，客户端将在必要时重新发送请求，如果重试耗尽则客户端将重新发送请求。	int	305000	[0,...]	medium
sasl.jaas.config	JAAS配置文件中SASL连接登录上下文参数。这里描述JAAS配置文件格式。该值的格式为： '(=)*;'	password	null		medium
sasl.kerberos.service.name	Kafka运行Kerberos principal名。可以在Kafka的JAAS配置文件或在Kafka的配置文件中定义。	string	null		medium
sasl.mechanism	用于客户端连接的SASL机制。安全提供者可用的机制。GSSAPI是默认机制。	string	GSSAPI		medium
security.protocol	用于与broker通讯的协议。有效值为：PLAINTEXT，SSL，SASL_PLAINTEXT，SASL_SSL。	string	PLAINTEXT		medium
send.buffer.bytes	发送数据时要使用的TCP发送缓冲区（SO_SNDBUF）的大小。如果值为-1，则将使用OS默认值。	int	131072	[-1,...]	medium
ssl.enabled.protocols	启用SSL连接的协议列表。	list	TLSv1.2,TLSv1.1,TLSv1		medium
ssl.keystore.type	key仓库文件的文件格式，客户端可选。	string	JKS		medium
ssl.protocol	用于生成SSLContext的SSL协议。默认设置是TLS，这对大多数情况都是适用的。最新的JVM中的允许值为TLS，TLSv1.1和TLSv1.2。较旧的JVM可能支持SSL，SSLv2和SSLv3，但由于已知的安全漏洞，不建议使用SSL。	string	TLS		medium
ssl.provider	用于SSL连接的安全提供程序的名称。默认值是JVM的默认安全提供程序。	string	null		medium
ssl.truststore.type	信任存储文件的文件格式。	string	JKS		medium
auto.commit.interval.ms	如果enable.auto.commit设置为true，则消费者偏移量自动提交给Kafka的频率（以毫秒为单位）。	int	5000	[0,...]	low
check.crcs	自动检查CRC32记录的消耗。这样可以确保消息发生时不会在线或磁盘损坏。此检查增加了一些开销，因此在寻求极致性能的情况下可能会被禁用。	boolean	true		low
client.id	在发出请求时传递给服务器的id字符串。这样做的目的是通过允许将逻辑应用程序名称包含在服务器端请求日志记录中，来跟踪ip/port的请求源。	string	""		low
fetch.max.wait.ms	如果没有足够的数据满足fetch.min.bytes，服务器将在接收到提取请求之前阻止的最大时间。	int	500	[0,...]	low
interceptor.classes	用作拦截器的类的列表。你可实现ConsumerInterceptor接口以允许拦截（也可能变化）消费者接收的记录。默认情况下，没有拦截器。	list	null		low
metadata.max.age.ms	在一定时间段之后（以毫秒为单位的），强制更新元数据，即使没有任何分区领导变化，任何新的broker或分区。	long	300000	[0,...]	low
metric.reporters	用作度量记录员类的列表。实现MetricReporter接口以允许插入通知新的度量创建的类。JmxReporter始终包含在注册JMX统计信息中。	list	""		low
metrics.num.samples	保持的样本数以计算度量。	int	2	[1,...]	low
metrics.recording.level	最高的记录级别。	string	INFO	[INFO, DEBUG]	low
metrics.sample.window.ms	The window of time a metrics sample is computed over.	long	30000	[0,...]	low
reconnect.backoff.ms	尝试重新连接指定主机之前等待的时间，避免频繁的连接主机，这种机制适用于消费者向broker发送的所有请求。	long	50	[0,...]	low
retry.backoff.ms	尝试重新发送失败的请求到指定topic分区之前的等待时间。避免在某些故障情况下，频繁的重复发送。	long	100	[0,...]	low
sasl.kerberos.kinit.cmd Kerberos	kinit命令路径。	string	/usr/bin/kinit		low
sasl.kerberos.min.time.before.relogin	尝试/恢复之间的登录线程的休眠时间。	long	60000		low
sasl.kerberos.ticket.renew.jitter	添加到更新时间的随机抖动百分比。	double	0.05		low
sasl.kerberos.ticket.renew.window.factor	登录线程将休眠，直到从上次刷新到ticket的指定的时间窗口因子到期，此时将尝试续订ticket。	double	0.8		low
ssl.cipher.suites	密码套件列表，用于TLS或SSL网络协议的安全设置，认证，加密，MAC和密钥交换算法的明明组合。默认情况下，支持所有可用的密码套件。	list	null		low
ssl.endpoint.identification.algorithm	使用服务器证书验证服务器主机名的端点识别算法。	string	null		low
ssl.keymanager.algorithm	密钥管理器工厂用于SSL连接的算法。默认值是为Java虚拟机配置的密钥管理器工厂算法。	string	SunX509		low
ssl.secure.random.implementation	用于SSL加密操作的SecureRandom PRNG实现。	string	null		low
ssl.trustmanager.algorithm	信任管理器工厂用于SSL连接的算法。默认值是为Java虚拟机配置的信任管理器工厂算法。	string	PKIX		low

3.4.2 旧消费者配置

旧消费者配置如下：

group.id
zookeeper.connect

PROPERTY	DEFAULT	DESCRIPTION
group.id		标识消费者所属消费者组(独一的)。通过设置相同的组ID，多个消费者表明属于该消费者组的一部分。
zookeeper.connect		指定ZooKeeper连接字符串，格式为hostname：port，其中host和port是ZooKeeper服务器的主机和端口。为了使ZooKeeper宕机时连接到其他ZooKeeper节点，你还可以以hostname1:host1，hostname2:port2，hostname3:port3的形式指定多个主机。还可以设置ZooKeeper chroot路径，作为其ZooKeeper连接字符串的一部分，将其数据放置在全局ZooKeeper命名空间中的某个路径下。如果是这样，消费者应该在其连接字符串中使用相同的chroot路径。例如，要给出/chroot/path的chroot路径，你需要将该值设置为：`hostname1:port1,hostname2:port2,hostname3:port3/chroot/path`。
consumer.id	null	如果未设置将自动生成。
socket.timeout.ms	30 * 1000	网络请求socker的超时时间。实际的超时是 max.fetch.wait+socket.timeout.ms的时间。
socket.receive.buffer.bytes	64 * 1024	网络请求socker的接收缓存大小
fetch.message.max.bytes	1024 * 1024	每个拉取请求的每个topic分区尝试获取的消息的字节大小。这些字节将被读入每个分区的内存，因此这有助于控制消费者使用的内存。拉取请求的大小至少与服务器允许的最大消息的大小一样大，否则生产者可能发送大于消费者可以拉取的消息。
num.consumer.fetchers	1	用于拉取数据的拉取线程数。
auto.commit.enable	true	如果为true，请定期向ZooKeeper提交消费者已经获取的消息的偏移量。当进程失败时，将使用这种承诺偏移量作为新消费者开始的位置。
auto.commit.interval.ms	60 * 1000	消费者offset提交到zookeeper的频率（以毫秒为单位）
queued.max.message.chunks	2	消费缓存消息块的最大大小。每个块可以达到fetch.message.max.bytes。
rebalance.max.retries	4	当新的消费者加入消费者组时，消费者集合尝试“重新平衡”负载，并为每个消费者分配分区。如果消费者集合在分配时发生时发生变化，则重新平衡将失败并重试。此设置控制尝试之前的最大尝试次数。
fetch.min.bytes	1	拉取请求返回最小的数据量。如果没有足够的数据，请求将等待数据积累，然后应答请求。
fetch.wait.max.ms	100	如果没有足够的数据（fetch.min.bytes），服务器将在返回请求数据之前阻塞的最长时间。
rebalance.backoff.ms	2000	重新平衡时重试之间的回退时间。如果未设置，则使用zookeeper.sync.time.ms中的值。
refresh.leader.backoff.ms	200	回退时间等待，然后再尝试选举一个刚刚失去leader的分区。
auto.offset.reset	largest	如果ZooKeeper中没有初始偏移量，或偏移值超出范围，该怎么办？最小：自动将偏移重置为最小偏移最大：自动将偏移重置为最大偏移 * 其他任何事情：抛出异常消费者
consumer.timeout.ms	-1	如果在指定的时间间隔后没有消息可用，则向用户发出超时异常
exclude.internal.topics	true	来自内部topic的消息（如偏移量）是否应该暴露给消费者。
client.id	group id value	客户端ID是每个请求中发送的用户指定的字符串，用于帮助跟踪调用。它应该逻辑地标识发出请求的应用程序。
zookeeper.session.timeout.ms	6000	ZooKeeper会话超时。如果消费者在这段时间内没有对ZooKeeper心跳，那么它被认为是死亡的，并且会发生重新平衡。
zookeeper.connection.timeout.ms	6000	与zookeeper建立连接时客户端等待的最长时间。
zookeeper.sync.time.ms	2000	ZK follower可以罗ZK leader多久
offsets.storage	zookeeper	选择存储偏移量的位置（zookeeper或kafka）。
offsets.channel.backoff.ms	1000	重新连接offset通道或重试失败的偏移提取/提交请求时的回退周期。
offsets.channel.socket.timeout.ms	10000	读取offset拉取/提交响应的Socker的超时时间。此超时也用于查询offset manager的ConsumerMetadata请求。
offsets.commit.max.retries	5	失败时重试偏移提交的最大次数。此重试计数仅适用于停机期间的offset提交，它不适用于自动提交线程的提交。它也不适用于在提交offset之前查询偏移协调器的尝试。即如果消费者元数据请求由于任何原因而失败，则将重试它，并且重试不计入该限制。
dual.commit.enabled	true	如果使用“kafka”作为offsets.storage，则可以向ZooKeeper（除Kafka之外）进行双重提交offset。在从基于zookeeper的offset存储迁移到kafka存储的时候可以这么做。对于任何给定的消费者组，在该组中的所有实例已迁移到提交到broker（而不是直接到ZooKeeper）的新的版本之后，可以关闭这个。
partition.assignment.strategy	range	在“range”或“roundrobin”策略之间选择将分区分配给消费者流。循环分区分配器分配所有可用的分区和所有可用的消费者线程。然后，继续从分区到消费者线程进行循环任务。如果所有消费者实例的订阅是相同的，则分区将被均匀分布。（即，分区所有权计数将在所有消费者线程之间的差异仅在一个delta之内。）循环分配仅在以下情况下被允许：（a）每个主题在消费者实例中具有相同数量的流（b）订阅的topic的对于组内的每个消费者实例都是相同的。范围(Range)分区基于每个topic。对于每个主题，我们按数字顺序排列可用的分区，并以字典顺序排列消费者线程。然后，我们将分区数除以消费者流（线程）的总数来确定分配给每个消费者的分区数。如果不均匀分割，那么前几个消费者将会有多的分区。

有关消费者配置的更多详细信息，请参见scaf类kafka.consumer.ConsumerConfig。

Kafka Streams配置

3.6 Kafka Streams配置

Kafka Stream客户端库配置（注意，窗口可拖动）。

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE
application.id	流处理应用程序标识。必须在Kafka集群中是独一无二的。 1）默认客户端ID前缀，2）成员资格管理的group-id，3）changgelog的topic前缀	string			high
bootstrap.servers	用于建立与Kafka集群的初始连接的主机/端口列表。客户端将会连接所有服务器，跟指定哪些服务器无关 - 通过指定的服务器列表会自动发现全部的服务器。此列表格式host1：port1，host2：port2，...由于这些服务器仅用于初始连接以发现完整的集群成员（可能会动态更改），所以此列表不需要包含完整集的服务器（您可能需要多个服务器，以防指定的服务器关闭）。	list			high
replication.factor	流处理程序创建更改日志topic和重新分配topic的副本数	int	1		high
state.dir	状态存储的目录地址。	string	/tmp/kafka-streams		high
cache.max.bytes.buffering	用于缓冲所有线程的最大内存字节数	long	10485760	[0,...]	low
client.id	发出请求时传递给服务器的id字符串。这样做的目的是通过允许将逻辑应用程序名称包含在服务器端请求日志记录中，来追踪请求源的ip/port。	string	""		high
default.key.serde	用于实现Serde接口的key的默认序列化器/解串器类。	class	org.apache.kafka.common.serialization.Serdes$ByteArraySerde		medium
default.timestamp.extractor	实现TimestampExtractor接口的默认时间戳提取器类。	class	org.apache.kafka.streams.processor.FailOnInvalidTimestamp		medium
default.value.serde	用于实现Serde接口的值的默认serializer / deserializer类。	class	org.apache.kafka.common.serialization.Serdes$ByteArraySerde		medium
num.standby.replicas	每个任务的备用副本数。	int	0		low
num.stream.threads	执行流处理的线程数。	int	1		low
processing.guarantee	应使用的加工保证。可能的值为at_least_once（默认）和exact_once。	string	at_least_once	[at_least_once, exactly_once]	medium
security.protocol	用于与broker沟通的协议。有效值为：PLAINTEXT，SSL，SASL_PLAINTEXT，SASL_SSL。	string	PLAINTEXT		medium
application.server	host:port指向用户嵌入定义的末端，可用于发现单个KafkaStreams应用程序中状态存储的位置	string	""		low
buffered.records.per.partition	每个分区缓存的最大记录数。	int	1000		low
commit.interval.ms	用于保存process位置的频率。注意，如果'processing.guarantee'设置为'exact_once'，默认值为100，否则默认值为30000。	long	30000		low
connections.max.idle.ms	关闭闲置的连接时间（以毫秒为单位）。	long	540000		medium
key.serde	用于实现Serde接口的key的Serializer/deserializer类.此配置已被弃用，请改用default.key.serde	class	null		low
metadata.max.age.ms	即使我们没有看到任何分区leader发生变化，主动发现新的broker或分区，强制更新元数据时间（以毫秒为单位）。	long	300000	[0,...]	low
metric.reporters	metric reporter的类列表。实现MetricReporter接口，JmxReporter始终包含在注册JMX统计信息中。	list	""		low
metrics.num.samples	保持的样本数以计算度量。	int	2	[1,...]	low
metrics.recording.level	日志级别。	string	INFO	[INFO, DEBUG]	low
metrics.sample.window.ms	时间窗口计算度量标准。	long	30000	[0,...]	low
partition.grouper	实现PartitionGrouper接口的Partition grouper类。	class	org.apache .kafka.streams .processor .DefaultPartitionGrouper		medium
poll.ms	阻塞输入等待的时间（以毫秒为单位）。	long	100		low
receive.buffer.bytes	读取数据时使用的TCP接收缓冲区（SO_RCVBUF）的大小。如果值为-1，则将使用OS默认值。	int	32768	[0,...]	medium
reconnect.backoff.max.ms	因故障无法重新连接broker，重新连接的等待的最大时间（毫秒）。如果提供，每个主机会连续增加，直到达到最大值。随机递增20％的随机抖动以避免连接风暴。	long	1000	[0,...]	low
reconnect.backoff.ms	尝试重新连接之前等待的时间。避免在高频繁的重复连接服务器。这种backoff适用于消费者向broker发送的所有请求。	long	50	[0,...]	low
request.timeout.ms	控制客户端等待请求响应的最长时间。如果在配置时间内未收到响应，客户端将在需要时重新发送请求，如果重试耗尽，则请求失败。	int	40000	[0,...]	low
retry.backoff.ms	尝试重试失败请求之前等待的时间。以避免了在某些故障情况下，在频繁重复发送请求。	long	100	[0,...]	low
rocksdb.config.setter	一个Rocks DB配置setter类，或实现RocksDBConfigSetter接口的类名	null			low
send.buffer.bytes	发送数据时要使用的TCP发送缓冲区（SO_SNDBUF）的大小。如果值为-1，则将使用OS默认值。	int	131072	[0,...]	low
state.cleanup.delay.ms	在分区迁移删除状态之前等待的时间（毫秒）。	long	60000		low
timestamp.extractor	实现TimestampExtractor接口的Timestamp抽取器类。此配置已弃用，请改用default.timestamp.extractor	class	null		low
windowstore.changelog.additional.retention.ms	添加到Windows维护管理器以确保数据不会从日志中过早删除。默认为1天	long	86400000		low
zookeeper.connect	Zookeeper连接字符串，用于Kafka主题管理。此配置已被弃用，将被忽略，因为Streams API不再使用Zookeeper。	string	""		low

Kafka Connect配置

3.5 Kafka Connect Configs

Kafka Connect框架的相关配置（注意，窗口可向右拉动）。

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE
config.storage.topic	kafka topic仓库配置	string			high
group.id	唯一的字符串，用于标识此worker所属的Connect集群组。	string			high
key.converter	用于Kafka Connect和写入到Kafka的序列化消息的之间格式转换的转换器类。这可以控制写入或从kafka读取的消息中的键的格式，并且由于这与连接器无关，因此它允许任何连接器使用任何序列化格式。常见格式的示例包括JSON和Avro。	class			high
offset.storage.topic	连接器的offset存储到哪个topic中	string			high
status.storage.topic	追踪连接器和任务状态存储到哪个topic中	string			high
value.converter	用于Kafka Connect格式和写入Kafka的序列化格式之间转换的转换器类。控制了写入或从Kafka读取的消息中的值的格式，并且由于这与连接器无关，因此它允许任何连接器使用任何序列化格式。常见格式的示例包括JSON和Avro。	class			high
internal.key.converter	用于在Kafka Connect格式和写入Kafka的序列化格式之间转换的转换器类。这可以控制写入或从Kafka读取的消息中的key的格式，并且由于这与连接器无关，因此它允许任何连接器使用任何序列化格式。常见格式的示例包括JSON和Avro。此设置用于控制框架内部使用的记账数据的格式，例如配置和偏移量，因此用户可以使用运行各种Converter实现。	class			low
internal.value.converter	用于在Kafka Connect格式和写入Kafka的序列化格式之间转换的转换器类。这控制了写入或从Kafka读取的消息中的值的格式，并且由于这与连接器无关，因此它允许任何连接器使用任何序列化格式。常见格式的示例包括JSON和Avro。此设置用于控制框架内部使用的记账数据的格式，例如配置和偏移量，因此用户可以使用运行各种Converter实现。	class			low
bootstrap.servers	用于建立与Kafka集群的初始连接的主机/端口列表。此列表用来发现完整服务器集的初始主机。该列表的格式应为host1：port1，host2：port2，....由于这些服务器仅用于初始连接以发现完整的集群成员资格（可能会动态更改），因此,不需要包含完整的服务器（尽管如此，你需要多配置几个，以防止配置的宕机）。	list	localhost:9092		high
heartbeat.interval.ms	心跳间隔时间。心跳用于确保会话保持活动，并在新成员加入或离开组时进行重新平衡。该值必须设置为低于session.timeout.ms，但通常应设置为不高于该值的1/3。	int	3000		high
rebalance.timeout.ms	限制所有组中消费者的任务处理数据和提交offset所需的时间。如果超时，那么woker将从组中删除，这也将导致offset提交失败。	int	60000		high
session.timeout.ms	用于察觉worker故障的超时时间。worker定时发送心跳以表明自己是活着的。如果broker在会话超时时间到期之前没有接收到心跳，那么broker将从分组中移除该worker，并启动重新平衡。注意，该值必须在`group.min.session.timeout.ms`和`group.max.session.timeout.ms`范围内。	int	10000		high
ssl.key.password	密钥存储文件中私钥的密码。这对于客户端是可选的。	password	null		high
ssl.keystore.location	密钥存储文件的位置。这对于客户端是可选的，可以用于客户端的双向身份验证。	string	null		high
ssl.keystore.password	密钥存储文件的存储密码。客户端是可选的，只有配置了ssl.keystore.location才需要。	password	null		high
ssl.truststore.location	信任存储文件的位置。	string	null		high
ssl.truststore.password	信任存储文件的密码。	password	null		high
connections.max.idle.ms	多少毫秒之后关闭空闲的连接。	long	540000		medium
receive.buffer.bytes	读取数据时使用的TCP接收缓冲区（SO_RCVBUF）的大小。如果值为-1，则将使用OS默认值。	int	32768	[0,...]	medium
request.timeout.ms	配置控制客户端等待请求响应的最长时间。如果在超时之前未收到响应，客户端将在必要时重新发送请求，如果重试耗尽，则该请求将失败。	int	40000	[0,...]	medium
sasl.jaas.config	用于JAAS配置文件的SASL连接的JAAS登录上下文参数格式。这里描述了JAAS配置文件的格式。该值的格式为：' (=)*;'	password	null		medium
sasl.kerberos.service.name	Kafka运行的Kerberos principal名称。可以在Kafka的JAAS配置或Kafka的配置中定义。	string	null		medium
sasl.mechanism	用户客户端连接的SASL机制。可以提供者任何安全机制。 GSSAPI是默认机制。	string	GSSAPI		medium
security.protocol	用于和broker通讯的策略。有效的值有：PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL。		string	PLAINTEXT	medium
send.buffer.bytes	发送数据时使用TCP发送缓冲区（SO_SNDBUF）的大小。如果值为-1，则将使用OS默认。	int	131072	[-1,...]	medium
ssl.enabled.protocols	启用SSL连接的协议列表。	list	TLSv1.2,TLSv1 .1,TLSv1		medium
ssl.keystore.type	密钥存储文件的文件格式。对于客户端是可选的。	string	JKS		medium
ssl.protocol	用于生成SSLContext的SSL协议。默认设置是TLS，这对大多数情况都是适用的。最新的JVM中的允许值为TLS，TLSv1.1和TLSv1.2。旧的JVM可能支持SSL，SSLv2和SSLv3，但由于已知的安全漏洞，不建议使用SSL。	string	TLS		medium
ssl.provider	用于SSL连接的安全提供程序的名称。默认值是JVM的默认安全提供程序。	string	null		medium
ssl.truststore.type	信任存储文件的文件格式。	string	JKS		medium
worker.sync.timeout.ms	当worker与其他worker不同步并需要重新同步配置时，需等待一段时间才能离开组，然后才能重新加入。	int	3000		medium
worker.unsync.backoff.ms	当worker与其他worker不同步，并且无法在worker.sync.timeout.ms 期间追赶上，在重新连接之前，退出Connect集群的时间。	int	300000		medium
access.control.allow.methods	通过设置Access-Control-Allow-Methods标头来设置跨源请求支持的方法。 Access-Control-Allow-Methods标头的默认值允许GET，POST和HEAD的跨源请求。	string	""		low
access.control.allow.origin	将Access-Control-Allow-Origin标头设置为REST API请求。要启用跨源访问，请将其设置为应该允许访问API的应用程序的域，或者 *" 以允许从任何的`域`。默认值只允许从REST API的域访问。	string	""		low
client.id	在发出请求时传递给服务器的id字符串。这样做的目的是通过允许逻辑应用程序名称包含在请求消息中，来跟踪请求来源。而不仅仅是ip/port	string	""		low
config.storage.replication.factor	当创建配置仓库topic时的副本数	short	3	[1,...]	low
metadata.max.age.ms	在没有任何分区leader改变，主动地发现新的broker或分区的时间。	long	300000	[0,...]	low
metric.reporters	A list of classes to use as metrics reporters. Implementing the MetricReporter interface allows plugging in classes that will be notified of new metric creation. The JmxReporter is always included to register JMX statistics.	list	""		low
metrics.num.samples	保留计算metrics的样本数（译者不清楚是做什么的）	int	2	[1,...]	low
metrics.sample.window.ms	The window of time a metrics sample is computed over.	long	30000	[0,...]	low
offset.flush.interval.ms	尝试提交任务偏移量的间隔。	long	60000		low
offset.flush.timeout.ms	在取消进程并恢复要在之后尝试提交的offset数据之前，等待消息刷新并分配要提交到offset仓库的offset数据的最大毫秒数。	long	5000		low
offset.storage.partitions	创建offset仓库topic的分区数	int	25	[1,...]	low
offset.storage.replication.factor	创建offset仓库topic的副本数	short	3	[1,...]	low
plugin.path	包含插件(连接器,转换器,转换)逗号(,)分隔的路径列表。该列表应包含顶级目录，其中包括以下任何组合：a）包含jars与插件及其依赖关系的目录 b）具有插件及其依赖项的uber-jars c）包含插件类的包目录结构的目录及其依赖关系,注意：将遵循符号链接来发现依赖关系或插件。示例：plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors	list	null		low
reconnect.backoff.max.ms	无法连接broker时等待的最大时间（毫秒）。如果设置，则每个host的将会持续的增加，直到达到最大值。计算增加后，再增加20％的随机抖动，以避免高频的反复连接。	long	1000	[0,...]	low
reconnect.backoff.ms	尝试重新连接到主机之前等待的时间。避免了高频率反复的连接主机。这种机制适用于消费者向broker发送的所有请求。	long	50	[0,...]	low
rest.advertised.host.name	如果设置，其他wokers将通过这个hostname进行连接。	string	null		low
rest.advertised.port	如果设置，其他的worker将通过这个端口进行连接。	int	null		low
rest.host.name	REST API的主机名。如果设置，它将只绑定到这个接口。	string	null		low
rest.port	用于监听REST API的端口	int	8083	low
retry.backoff.ms	失败请求重新尝试之前的等待时间，避免了在某些故障的情况下，频繁的重复发送请求。	long	100	[0,...]	low
sasl.kerberos.kinit.cmd	Kerberos kinit命令路径.	string	/usr/bin/kinit		low
sasl.kerberos.min.time.before.relogin	尝试refresh之间登录线程的休眠时间.	long	60000		low
sasl.kerberos.ticket.renew.jitter	添加到更新时间的随机抖动百分比。	double	0.05		low
sasl.kerberos.ticket.renew.window.factor	登录线程将休眠，直到从上次刷新ticket到期，此时将尝试续订ticket。	double	0.8		low
ssl.cipher.suites	密码套件列表。用于TLS或SSL网络协议协商网络连接的安全设置的认证，加密，MAC和密钥交换算法的命名组合。默认情况下，支持所有可用的密码套件。	list	null		low
ssl.endpoint.identification.algorithm	末端识别算法使用服务器证书验证服务器主机名。	string	null		low
ssl.keymanager.algorithm	用于SSL连接的key管理工厂的算法，默认值是Java虚拟机配置的密钥管理工厂算法。	string	SunX509		low
ssl.secure.random.implementation	用于SSL加密操作的SecureRandom PRNG实现。	string	null		low
ssl.trustmanager.algorithm	用于SSL连接的信任管理仓库算法。默认值是Java虚拟机配置的信任管理器工厂算法。	string	PKIX		low
status.storage.partitions	用于创建状态仓库topic的分区数	int	5	[1,...]	low
status.storage.replication.factor	用于创建状态仓库topic的副本数	short	3	[1,...]	low
task.shutdown.graceful.timeout.ms	等待任务正常关闭的时间，这是总时间，不是每个任务，所有任务触发关闭，然后依次等待。	long	5000		low

Kafka AdminClient配置

3.7 AdminClient Configs

以下是Kafka Admin客户端库的配置。

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE
bootstrap.servers	host/port,用于和kafka集群建立初始化连接。因为这些服务器地址仅用于初始化连接，并通过现有配置的来发现全部的kafka集群成员（集群随时会变化），所以此列表不需要包含完整的集群地址（但尽量多配置几个，以防止配置的服务器宕机）。	list			high
ssl.key.password	密钥仓库文件中的私钥密码。对于客户端是可选的。	password	null		high
ssl.keystore.location	密钥仓库文件的位置。这对于客户端是可选的，可以用于客户端的双向认证。	string	null		high
ssl.keystore.password	密钥仓库文件的仓库密钥。这对于客户端是可选的，只有配置了ssl.keystore.location才需要。	password	null		high
ssl.truststore.location	信任存储文件的位置。	string	null		high
ssl.truststore.password	信任存储文件的密码。如果未设置密码，对信任库的访问仍然可用，但是完整性检查将被禁用。	password	null		high
client.id	在发出请求时传递给服务器的id字符串。这样做的目的是通过允许在服务器端请求日志记录中包含逻辑应用程序名称来跟踪请求源的ip/port。	string	""		medium
connections.max.idle.ms	关闭闲置连接的时间。	long	300000		medium
receive.buffer.bytes	读取数据时使用的TCP接收缓冲区（SO_RCVBUF）的大小。如果值为-1，则将使用OS默认值。	int	65536	[-1,...]	medium
request.timeout.ms	配置控制客户端等待请求响应的最长时间。如果在超时之前未收到响应，客户端将在必要时重新发送请求，如果重试耗尽，则该请求将失败。	int	120000	[0,...]	medium
sasl.jaas.config	JAAS配置文件使用的格式的SASL连接的JAAS登录上下文参数。这里描述JAAS配置文件格式。该值的格式为：' (=)*;'	password	null		medium
sasl.kerberos.service.name	Kafka运行的Kerberos principal名。可以在Kafka的JAAS配置或Kafka的配置中定义。	string	null		medium
sasl.mechanism	用于客户端连接的SASL机制。安全提供者可用的任何机制。GSSAPI是默认机制。	string	GSSAPI		medium
security.protocol	与broker通讯的协议。有效的值有: PLAINTEXT, SSL, SASL_PLAINTEXT,SASL_SSL.	string	PLAINTEXT		medium
send.buffer.bytes	发送数据时时使用TCP发送缓冲区（SO_SNDBUF）的大小。如果值为-1，则使用OS默认值。	int	131072	[-1,...]	medium
ssl.enabled.protocols	启用SSL连接的协议列表。	list	TLSv1.2,TLSv1.1,TLSv1		medium
ssl.keystore.type	密钥仓库文件的文件格式。对于客户端是可选的。	string	JKS		medium
ssl.protocol	用于生成SSLContext的SSL协议。默认设置是TLS，这对大多数情况都是适用的。最新的JVM中允许的值是TLS,TLSv1.1和TLSv1.2。较旧的JVM可能支持SSL,SSLv2和SSLv3,但由于已知的安全漏洞,不建议使用。	string	TLS		medium
ssl.provider	用于SSL连接的安全提供程序的名称。默认值是JVM的默认安全提供程序。	string	null		medium
ssl.truststore.type	信任仓库文件的文件格式	string	JKS		medium
metadata.max.age.ms	我们强制更新元数据的时间段（以毫秒为单位），即使我们没有任何分区leader发生变化，主动发现任何新的broker或分区。	long	300000	[0,...]	low
metric.reporters	用作指标记录的类的列表。实现MetricReporter接口，以允许插入将被通知新的度量创建的类。JmxReporter始终包含在注册JMX统计信息中。	list	""		low
metrics.num.samples	用于计算度量维护的样例数。	int	2	[1,...]	low
metrics.recording.level	The highest recording level for metrics.	string	INFO	[INFO, DEBUG]	low
metrics.sample.window.ms	时间窗口计算度量标准。	long	30000	[0,...]	low
reconnect.backoff.max.ms	重新连接到重复无法连接的broker程序时等待的最大时间（毫秒）。如果提供，每个主机的回退将会连续增加，直到达到最大值。计算后退增加后，增加20％的随机抖动以避免连接风暴。	long	1000	[0,...]	low
reconnect.backoff.ms	尝试重新连接到给定主机之前等待的基本时间量。这避免了在频繁的重复连接主机。此配置适用于client对broker的所有连接尝试。	long	50	[0,...]	low
retries	在失败之前重试调用的最大次数	int	5	[0,...]	low
retry.backoff.ms	尝试重试失败的请求之前等待的时间。这样可以避免在某些故障情况下以频繁的重复发送请求。	long	100	[0,...]	low
sasl.kerberos.kinit.cmd	Kerberos kinit命令路径。	string	/usr/bin/kinit		low
sasl.kerberos.min.time.before.relogin	刷新尝试之间的登录线程睡眠时间。	long	60000		low
sasl.kerberos.ticket.renew.jitter	添加到更新时间的随机抖动百分比。	double	0.05		low
sasl.kerberos.ticket.renew.window.factor	登录线程将休眠，直到从上次刷新到“票”到期时间的指定窗口为止，此时将尝试续订“票”。
	double	0.8		low
ssl.cipher.suites	密码套件列表。是TLS或SSL网络协议来协商用于网络连接的安全设置的认证，加密，MAC和密钥交换算法的命名组合。默认情况下，支持所有可用的密码套件。	list	null		low
ssl.endpoint.identification.algorithm	使用服务器证书验证服务器主机名的端点识别算法。	string	null		low
ssl.keymanager.algorithm	用于SSL连接的密钥管理工厂算法。默认值是Java虚拟机配置的密钥管理器工厂算法。	string	SunX509		low
ssl.secure.random.implementation	用于SSL加密操作的SecureRandom PRNG实现。	string	null		low
ssl.trustmanager.algorithm	用于SSL连接的信任管理工厂算法，默认是Java虚拟机机制。	string	PKIX		low

kafka设计动机

We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.
我们设计kafka是必须能够作为一个统一的平台，来处理一家大公司可能有的所有实时数据。要做到这一点，我们不得不思考一个相当广泛的用例。

It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
它必须具有高吞吐量，以支持大容量的事件流，如实时日志聚集。

It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.
它将需要优雅地处理大型数据积压，要能够支持从脱机系统的定期数据装载。

It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.
这也意味着该系统将不得不处理低延迟交付，以处理更传统的消息传递的用例。

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.
我们想要支持以分区，分布式的，实时处理来这些创建新派生的feeds，促使我们的分区和消费模式。

Finally in cases where the stream is fed into other data systems for serving we new the system would have to be able to guarantee fault-tolerance in the presence of machine failures.
最后，在其中该流被送入其他数据系统用于服务的情况下，我们的新的系统必须能够保证在机器故障存在的容错。

Supporting these uses led use to a design with a number of unique elements, more akin to a database log then a traditional messaging system. We will outline some elements of the design in the following sections.
为了支持这些用途，需要设计一些独特的元素。比起传统的消息系统，更像数据库日志，我们将在下一节介绍。

kafka持久化

不要害怕文件系统!

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.
Kafka高度依赖文件系统来存储和缓存消息。一般的人都认为“磁盘是缓慢的”，这使得人们对“持久化结构”持有怀疑态度。实际上，磁盘比人们预想的快很多也慢很多，这取决于它们如何被使用；一个好的磁盘结构设计可以使之跟网络速度一样快。

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!
一个有关磁盘性能的关键事实是：在过去的十年，磁盘驱动器的吞吐量跟磁盘寻道的延迟是相背离的。结果就是：JBOD配置的6个7200rpm SATA RAID-5 的磁盘阵列上线性写的速度大概是600M/秒，但是随机写的速度只有100K/秒，两者相差将近6000倍。线性读写在大多数应用场景下是可以预测的，因此，现代的操作系统提供了预读和写技术，将多个大块预取数据，并将较小的写入组合成一个大的物理写。更多的讨论可以在ACM Queue Artical中找到，他们发现，对磁盘的线性读在有些情况下可以比内存的随机访问要更快。

To compensate for this performance divergence modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.
为了补偿这个性能上的差异，现代操作系统用内存做磁盘缓存时变得越来越重。现在操作系统很乐意将所有空闲的内存用于磁盘缓冲，尽管在内存回收的时候会有一点性能上的代价。所有的磁盘读写会通过这个统一的缓存。没有使用直接I/O的情况下，不能轻易关闭此功能。所以即使一个进程维护着一个进程内的数据缓存，这些数据还是会在OS的页缓存中被复制，从而有效地存储两次。

Furthermore we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:
此外，我们建立在JVM的顶部，熟悉java内存应用管理的人应该清楚以下两件事情：

The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
一个对象的内存消耗是非常高的，经常是所存数据的两倍（或者更多）。
Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.
随着堆内数据的增多，Java的垃圾回收变得越来越繁琐而缓慢。

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.
由于这些原因，使用文件系统并依赖pagecache（页缓存）将优于维护一个内存缓存（或者其他结构） - 我们通过自动访问所有可用的内存使得可用的内存至少提高一倍。并可能通过存储压缩字节结构再次提高一倍。这将使得32G机器上高达28-32GB的缓存，并无需GC。此外，即使服务重新启动，该缓存保持可用，而进程内的缓存则需要在内存中重建（10GB缓存需要10分钟），否则将需要启动完全冷却的缓存（这意味着可怕的初始化性能）。这也大大简化了代码，因为在缓存和文件系统之间维持的一致性的所有逻辑现在都在OS中，这比一次性进程更加有效和更正确。如果你的磁盘支持线性的读取，那么预读取将有效地将每个磁盘中有用的数据预填充此缓存。

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.
这带来一个非常简单的设计：当内存空间耗尽时，将它全部flush到文件系统中，而不是尽可能把数据维持在内存中。我们反过来看，所有的数据直接写入到文件系统的持久化日志中，无需flush到磁盘上。实际上这只是意味着它被转移到内核的页缓存中。

This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).
这种以页缓存为中心的设计风格在这里描述。

常数时间就足够了（Constant Time Suffices）

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse then twice as slow.
在消息系统中使用的持久数据结构常常具有相关联的BTree或其他通过随机访问数据结构的每个消费者队列，以维护关于消息的元数据。BTrees是可用的最通用的数据结构，可以在消息系统中支持各种各样的事务和非事务性语义。尽管，Btree的操作是O(log N)，但它们的成本相当高。通常O(log N)O(log N)基本上等同于恒定时间，但是磁盘操作不是这样，磁盘寻找在10ms的pop，每个磁盘一次只能做一次寻找，所以并行性受限制。因此，即使是少量的磁盘搜索导致非常高的开销。由于存储系统将非常快速的缓存操作与非常慢的物理磁盘操作相结合，因为数据随固定缓存而增加，所以观察到的树结构的性能通常是超线性的。- 即，你的数据翻倍则使得事情慢两倍还多。

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.
直观上，持久队列可以建立在简单的读取和附加到文件上，就像日志解决方案的情况一样。这种结构的优点是所有操作都是O(1)，并且读取不会阻塞写入或彼此。这具有明显的性能优势，因为性能与数据大小完全分离 - 服务器现在可以充分利用这点，低转速 1+TB SATA驱动器。虽然这些驱动器的搜索性能不佳，但是对于大量的读写而言，这些驱动器具有可接受的性能，并且价格是1/3，能力为3倍。

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to deleting messages as soon as they are consumed, we can retain messages for a relative long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.
事实上，无需任何性能损失就可以访问几乎无限制的磁盘空间，这意味着我们可以提供一般消息传递系统无法提供的特性。例如，在Kafka中，消息被消费后不是立马被删除，我们可以保留消息相对较长的时间（例如一个星期）。这将为消费者带来很大的灵活性。

kafka效率

We have put significant effort into efficiency. One of our primary use cases is handling web activity data, which is very high volume: each page view may generate dozens of writes. Furthermore we assume each message published is read by at least one consumer (often many), hence we strive to make consumption as cheap as possible.
我们已经投入了很多精力到效率中。我们主要使用案例之一处理 web 活动数据，这是非常高的容量：每个页面视图模式下可以产生几十个写入操作。此外，我们假设每个发布的消息至少一名消费者（通常很多），因此我们努力使消费尽可能的廉价。

We have also found, from experience building and running a number of similar systems, that efficiency is a key to effective multi-tenant operations. If the downstream infrastructure service can easily become a bottleneck due to a small bump in usage by the application, such small changes will often create problems. By being very fast we help ensure that the application will tip-over under load before the infrastructure. This is particularly important when trying to run a centralized service that supports dozens or hundreds of applications on a centralized cluster as changes in usage patterns are a near-daily occurrence.
我们还从建设和运行多个类似的系统中发现，有效率的多租户操作是效率的关键。如果底层的基础架构服务由于应用程序少量的效率慢的代码而容易成为瓶颈，这种通常会产生问题。我们非常快速度的处理以确保应用程序在基础设施之前tip-over，从而使得低于负载。当在集中式集群中运行支持数十个或数百个应用程序的集中式服务时，这一点就会特别重要，因为使用模式的变化几乎每天都会发生。

We discussed disk efficiency in the previous section. Once poor disk access patterns have been eliminated, there are two common causes of inefficiency in this type of system: too many small I/O operations, and excessive byte copying.
我们在上一节中讨论的磁盘的效率。一旦穷磁盘访问模式被淘汰，这种类型的系统有两个低效率的常见原因：大量的小I/O操作，和过度的字节复制。

The small I/O problem happens both between the client and the server and in the server's own persistent operations.
client和server之间，服务器自己的持久化操作中，就会发生小I/O问题。

To avoid this, our protocol is built around a "message set" abstraction that naturally groups messages together. This allows network requests to group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The server in turn appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time.
为了避免这种情况，我们把这些消息集合在一起，这样减少网络请求的往返，而不是一次发送单个消息。

This simple optimization produces orders of magnitude speed up. Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.
这种简单的优化产生了数量级的加速。批处理导致更大的网络数据包，更大的顺序磁盘操作，连续的内存块等，这些都允许Kafka将随机消息写入的突发流转换成流向消费者的线性写入。

The other inefficiency is in byte copying. At low message rates this is not an issue, but under load the impact is significant. To avoid this we employ a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transferred without modification between them).
另一个是无效率是字节复制。在低速率下，这不是一个问题，但负载的情况下影响是显着的。为了避免这种情况，我们采用由生产者，经纪人和消费者共享的标准化二进制消息格式（样数据块就可以在它们之间自由传输，无需转换）。

The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks. Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket; in Linux this is done with the sendfile system call.
broker维护的消息日志本身就是文件的目录，每个文件都是由生产者和消费者使用相同的格式写入磁盘的。维护这个公共的格式并允许优化最重要的操作：网络传输持久性日志块。现代的unix操作系统提供一个优化的代码路径，用于将数据从页缓存传输到socket；在Linux中，是通过sendfile系统调用来完成的。Java提供了访问这个系统调用的方法：FileChannel.transferTo api。

To understand the impact of sendfile, it is important to understand the common data path for transfer of data from file to socket:
要理解senfile的影响，重要的是要了解将数据从文件传输到socket的公共数据路径：

The operating system reads data from the disk into pagecache in kernel space
操作系统将数据从磁盘读入到内核空间的页缓存

The application reads the data from kernel space into a user-space buffer
应用程序将数据从内核空间读入到用户空间缓存中
The application writes the data back into kernel space into a socket buffer
应用程序将数据写回到内核空间到socket缓存中
The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
操作系统将数据从socket缓冲区复制到网卡缓冲区，以便将数据经网络发出

This is clearly inefficient, there are four copies and two system calls. Using sendfile, this re-copying is avoided by allowing the OS to send the data from pagecache to the network directly. So in this optimized path, only the final copy to the NIC buffer is needed.
这样做明显是低效的，这里有四次拷贝，两次系统调用。如果使用sendfile，再次拷贝可以被避免：允许操作系统将数据直接从页缓存发送到网络上。所以在这个优化的路径中，只有最后一步将数据拷贝到网卡缓存中是需要的。

We expect a common use case to be multiple consumers on a topic. Using the zero-copy optimization above, data is copied into pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to kernel space every time it is read. This allows messages to be consumed at a rate that approaches the limit of the network connection.
我们假设一个topic有多个消费者的情况。并使用上面的零拷贝优化，数据被复制到页缓存中一次，并在每个消费上重复使用，而不是存储在存储器中，也不在每次读取时复制到用户空间。这使得以接近网络连接限制的速度消费消息。

This combination of pagecache and sendfile means that on a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks whatsoever as they will be serving data entirely from cache.
这种页缓存和sendfile组合，意味着Kafka集群的消费者大多数都完全从缓存消费消息，而磁盘没有任何读取活动。

For more background on the sendfile and zero-copy support in Java, see this article.
有关在Java中Sendfile和zero-copy（零拷贝）的支持更多的背景，请参阅本文。

端到端的批量压缩

In some cases the bottleneck is actually not CPU or disk but network bandwidth. This is particularly true for a data pipeline that needs to send messages between data centers over a wide-area network. Of course the user can always compress its messages one at a time without any support needed from Kafka, but this can lead to very poor compression ratios as much of the redundancy is due to repetition between messages of the same type (e.g. field names in JSON or user agents in web logs or common string values). Efficient compression requires compressing multiple messages together rather than compressing each message individually.
在某些情况下，瓶颈实际上不是CPU或磁盘，而是网络带宽。对于需要在广域网上的数据中心之间发送消息的数据流水线尤其如此。当然，用户可以一次压缩其消息，而无需Kafka所需的任何支持，但这可能导致非常差的压缩比，因为冗余的很多是由于相同类型的消息之间的重复（例如， Web日志中的JSON或用户代理或通用字符串值）。有效的压缩需要压缩多个消息，而不是单独压缩每个消息。

Kafka supports this by allowing recursive message sets. A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
Kafka通过递归消息集来支持这一点。一批消息可以一起压缩并以此形式发送到服务器。这批消息将以压缩形式写入，并将在日志中保持压缩，并且只能由消费者解压缩。

Kafka supports GZIP and Snappy compression protocols. More details on compression can be found here.
Kafka支持GZIP和Snappy压缩协议，更多的细节可以在这里找到：https://cwiki.apache.org/confluence/display/KAFKA/Compression

kafka生产者

负载平衡

The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriate direct its requests.
生产者将数据直接发送到分区leader的broker上（没有任何干预的路由层）。为了帮助producer做到这一点，Kafka所有节点都可应答给producer哪些服务器是正常的，哪些topic分区的leader允许producer在给定的时间内可以直接请求。

The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balancing, or it can be done by some semantic partitioning function. We expose the interface for semantic partitioning by allowing the user to specify a key to partition by and using this to hash to a partition (there is also an option to override the partition function if need be). For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers.
客户端控制消息发布到哪个parition，可以随机，实现一种的随机负载平衡，或者也可以通过语义分区函数，我们暴露接口，以允许用户通过key去指定分区和使用hash来指向分区（如果需要，可重写分区函数）。例如：如果选择的key是用户ID，那么对给定的用户ID的所有数据将被发送到相同分区。反过来，消费者有能指定消费那个分区，这种设计风格，让消费者可以对敏感性的消息进行局部处理。

异步发送 asynchronous send

Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.
批处理是效率的一大驱动力，kafka生产者使用批处理试图在内存中积累数据，在单个请求发送累积的大批量数据，可以配置批处理积累的不大于一定的消息数，并等待时间不超过配置的延迟（64k 或 10毫秒）。这将累积更多消息用于少数较大的I/O操作上，为了更好的吞吐量，这种缓存是可配置，并给出一种来权衡极少量的额外的延迟的机制。

Details on configuration and api for the producer can be found elsewhere in the documentation.
生产者的配置和api的详细信息可以在其他文档中找到。

kafka消费者

The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.
kafka消费者通过向broker的leader分区发起“提取”请求。消费者指定每次请求日志的偏移量并收到那一块日志的起始位置。因此，消费者可以重新指定位置，重新消费。

推送 vs 拉取

An initial question we considered is whether consumers should pull data from brokers or brokers should push data to the consumer. In this respect Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. Some logging-centric systems, such as Scribe and Apache Flume follow a very different push based path where data is pushed downstream. There are pros and cons to both approaches. However a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. The goal is generally for the consumer to be able to consume at the maximum possible rate; unfortunately in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below the rate of production (a denial of service attack, in essence). A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fashion led us to go with a more traditional pull model.
我们考虑的第一个问题是消费者应该从broker中pull数据还是broker向消费者push数据，在这方面，kafka遵循比较传统的设计，大多数消息系统，生产者推消息到broker，消费者从broker拉取消息，一些日志中心的系统，比如 Scribe 和Apache Flume ，采用非常不同的push模式（push数据到下游）。事实上，push模式和pull模式各有优劣。push模式很难适应消费速率不同的消费者，因为消息发送速率是由broker决定的。push模式的目标是尽可能以最快速度传递消息，但是这样很容易造成消费者来不及处理消息，典型的表现就是拒绝服务以及网络拥塞。而pull模式则可以根据consumer的消费能力以适当的速率消费消息。

Another advantage of a pull-based system is that it lends itself to aggressive batching of data sent to the consumer. A push-based system must choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the downstream consumer will be able to immediately process it. If tuned for low latency this will result in sending a single message at a time only for the transfer to end up being buffered anyway, which is wasteful. A pull-based design fixes this as the consumer always pulls all available messages after its current position in the log (or up to some configurable max size). So one gets optimal batching without introducing unnecessary latency.
基于pull模式的另一个优点是，它有助于积极的批处理的数据发送到消费者。基于push模式必须选择要么立即发送请求或者积累更多的数据，然后在不知道下游消费者是否能够立即处理它的情况下发送，如果是低延迟，这将导致一次只发送一条消息，以便传输缓存，这是实在是一种浪费，基于pull的设计解决这个问题，消费者总是pull在日志的当前位置之后pull所有可用的消息（或配置一些大size），所以消费者可设置消费多大的量，也不会引起不必要的等待时间。

The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a "long poll" waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).
基于pull模式不足之处在于，如果broker没有数据，消费者会轮询，忙等待数据直到数据到达，为了避免这种情况，我们允许消费者在pull请求时候使用“long poll”进行阻塞，直到数据到达（并且设置等待时间的好处是可以积累消息，组成大数据块一并发送）。

You could imagine other possible designs which would be only pull, end-to-end. The producer would locally write to a local log, and brokers would pull from that with consumers pulling from them. A similar type of "store-and-forward" producer is often proposed. This is intriguing but we felt not very suitable for our target use cases which have thousands of producers. Our experience running persistent data systems at scale led us to feel that involving thousands of disks in the system across many applications would not actually make things more reliable and would be a nightmare to operate. And in practice we have found that we can run a pipeline with strong SLAs at large scale without a need for producer persistence.
你可以想一些其他的可能性的设计，不仅仅是pull，端对端。生产者在本地写入本地日志，broker从那里pull，消费者再pull，类似一种“存储-转发”的生产者，这种方式很有意思，但是我们觉得不是很适合我们这种有成千上万的生产者的情况，我们的大规模运行的持久化数据系统的经验使我们觉得，在许多应用领域涉及数以千计的系统磁盘不会真的使事情变得更加可靠，将是操作的噩梦。而在实践中我们发现，我们可以运行与大型强大的SLA管道，而不需要生产的持久性。

消费者定位

Keeping track of what has been consumed, is, surprisingly, one of the key performance points of a messaging system.
追踪已经消费的消息是令人惊讶的，在消息系统中，这是关键的性能之一。

Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker either records that fact locally immediately or it may wait for acknowledgement from the consumer. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else this state could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.
大多数消息系统保留在broker上消费消息的元数据。也就是说，当消息发送给消费者时，broker本地立即记录该事实，或者可以等待消费者的应答确认。这是一个相当直观的选择，实际上对于单个机器服务器来说，尚不清楚这个状态是什么。由于许多消息系统中用于存储的数据结构规模不大，这也是务实的选择 - 因为broker知道哪些已经消费，可以立即删除它，从而保持数据大小不变。

What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.
并不明显的是，让broker和消费者所消费的消息达成一致并不是一个微不足道的问题。如果broker每次通过网络发出消息立即记录的话，那么如果消费者无法处理该消息（比如崩溃或请求超时），则该消息将丢失。为了解决这个问题，许多消息系统添加了一个“应答”功能，这意味着当消息发送时，消息仅仅标记为“发送”而不是“已消费”。broker等待消费者应答该消息，消息才被标记为“已消费”。这确认解决了丢失消息的问题，但是产生了一个新的问题。首先，如果消费者处理了消息，但是在发送应答时失败了，那么该消息将会被处理两次。第二个问题是关于性能，现在broker必须保持关于每个单个消息的多个状态（首先锁定它，所以它不会被发送两次，然后将其标记为永久已消耗，以便可以被删除）。必须处理这些棘手的问题，就像发送但未应答的消息一样。

Kafka handles this differently. Our topic is divided into a set of totally ordered partitions, each of which is consumed by one consumer at any given time. This means that the position of consumer in each partition is just a single integer, the offset of the next message to consume. This makes the state about what has been consumed very small, just one number for each partition. This state can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.
kafka处理方式不同。我们的topic被分为一组完全有序的分区，每个分区在任何给定的时间都由每个订阅消费者组中的一个消费者消费。这意味着消费者在每个分区中的位置只是一个整数，下一个消息消费的偏移量。这使得关于已消费到哪里的状态变得非常的小，每个分区只有一个数字。可以定期检查此状态。这使得等同于消息应答并更轻量。

There is a side benefit of this decision. A consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and is discovered after some messages are consumed, the consumer can re-consume those messages once the bug is fixed.
这么做有一个好处。消费者可以故意地回到旧的偏移量并重新消费数据。这违反了一个队列的共同契约，但这被证明是许多消费者的基本特征。例如，如果消费者代码有bug，并且在消费一些消息之后被发现，消费者可以在修复错误后重新消费这些消息。

离线数据加载

Scalable persistence allows for the possibility of consumers that only periodically consume such as batch data loads that periodically bulk-load data into an offline system such as Hadoop or a relational data warehouse.
可扩展持久性，以允许消费者只需周期性地消费如批量数据加载的可能性，以便将数据定期批量加载到如Hadoop或关系数据仓库之类的离线系统中。

In the case of Hadoop we parallelize the data load by splitting the load over individual map tasks, one for each node/topic/partition combination, allowing full parallelism in the loading. Hadoop provides the task management, and tasks which fail can restart without danger of duplicate data—they simply restart from their original position.
在Hadoop的情况下，我们通过将负载分解为单独的map任务来并行化数据负载，每个node/topic/partition组合一个负载，允许在加载中完全并行。 Hadoop提供任务管理，无法重新启动的任务可以重新启动，而不会有重复数据的危险 - 他们只需从原始位置重新启动。

kafka消息传递保障

Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
现在我们了解一些关于生产者和消费者是如何工作的，接下来我们来讨论kafka提供了生产者和消费者之间的担保语义。有多种可能的消息传递保证可以提供：

At most once—Messages may be lost but are never redelivered.
最多一次 --- 消息可能丢失，但绝不会重发。
At least once—Messages are never lost but may be redelivered.
至少一次 --- 消息绝不会丢失，但有可能重新发送。
Exactly once—this is what people actually want, each message is delivered once and only once.
正好一次 --- 这是人们真正想要的，每个消息传递一次且仅一次。

It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.
可分解成两个问题：发送消息时的耐久性保障和消费消息的保障。

Many systems claim to provide "exactly once" delivery semantics, but it is important to read the fine print, most of these claims are misleading (i.e. they don't translate to the case where consumers or producers can fail, or cases where there are multiple consumer processes, or cases where data written to disk can be lost).
很多消息系统声称提供“正好一次”的传递语义，但是在阅读相关文章时，更多是误导（例如，它们没有解释消费者或生产者可能失败的情况，有多个消费者进程的情况，或写入磁盘的数据可能丢失的情况）

Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". The definition of alive as well as a description of which types of failures we attempt to handle will be described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to the producer and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.
kafka的语义是很直接的，我们有一个概念，当发布一条消息时，该消息 “committed（承诺）” 到了日志，一旦发布的消息是”承诺“的，只要副本分区写入了此消息的一个broker仍然"活着”，它就不会丢失。“活着”的定义以及描述的类型，我们处理失败的情况将在下一节中详细描述。现在让我们假设一个完美的不会丢消息的broker，并去了解如何保障生产者和消费者的，如果一个生产者发布消息并且正好遇到网络错误，就不能确定已提交的消息是否是在这个错误发生之前或之后。这类似于用自动生成key插入到一个数据库表。

Prior to 0.11.0.0, if a producer failed to receive a response indicating that a message was committed, it had little choice but to resend the message. This provides at-least-once delivery semantics since the message may be written to the log again during resending if the original request had in fact succeeded. Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will not result in duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence number that is sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages to multiple topic partitions using transaction-like semantics: i.e. either all messages are successfully written or none of them are. The main use case for this is exactly-once processing between Kafka topics (described below).
在0.11.0.0之前，如果一个生产者没有收到消息提交的响应，那么只能重新发送消息。这提供了至少一次传递语义，因为如果原始请求实际上已成功，则在重新发送期间再次将消息写入到日志中。自0.11.0.0起，Kafka生产者支持幂等传递选项，保证重新发送不会导致日志中重复。 broker为每个生产者分配一个ID，并通过生产者发送的序列号为每个消息进行去重。从0.11.0.0开始，生产者支持使用类似事务的语义将消息发送到多个topic分区的能力：即所有消息都被成功写入，或者没有。这个主要用于Kafka topic之间“正好一次“处理（如下所述）。

Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.
并不是所有的情况都需要这么强力的保障，对于延迟敏感的，我们允许生产者指定它想要的耐用性水平。如生产者可以指定它获取需等待10毫秒量级上的响应。生产者也可以指定异步发送，或只等待leader（不需要副本的响应）有响应。

Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.
现在让我们从消费者的角度描述语义。所有的副本都有相同的日志相同的偏移量。消费者控制offset在日志中的位置。如果消费者永不宕机它可能只是在内存中存储这个位置，但是如果消费者故障，我们希望这个topic分区被另一个进程接管，新进程需要选择一个合适的位置开始处理。我们假设消费者读取了一些消息，几种选项用于处理消息和更新它的位置。

It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
读取消息，然后在日志中保存它的位置，最后处理消息。在这种情况下，有可能消费者保存了位置之后，但是处理消息输出之前崩溃了。在这种情况下，接管处理的进程会在已保存的位置开始，即使该位置之前有几个消息尚未处理。这对应于“最多一次” ，在消费者处理失败消息的情况下，不进行处理。
It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).
读取消息，处理消息，最后保存消息的位置。在这种情况下，可能消费进程处理消息之后，但保存它的位置之前崩溃了。在这种情况下，当新的进程接管了它，这将接收已经被处理的前几个消息。这就符合了“至少一次”的语义。在多数情况下消息有一个主键，以便更新幂等（其任意多次执行所产生的影响均与一次执行的影响相同）。
So what about exactly once semantics (i.e. the thing you actually want)? When consuming from a Kafka topic and producing to another topic (as in a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0 that were mentioned above. The consumer's position is stored as a message in a topic, so we can write the offset to Kafka in the same transaction as the output topics receiving the processed data. If the transaction is aborted, the consumer's position will revert to its old value and the produced data on the output topics will not be visible to other consumers, depending on their "isolation level." In the default "read_uncommitted" isolation level, all messages are visible to consumers even if they were part of an aborted transaction, but in "read_committed," the consumer will only return messages from transactions which were committed (and any messages which were not part of a transaction).
那么什么是“正好一次”语义（也就是你真正想要的东西）? 当从Kafka主题消费并生产到另一个topic时（例如Kafka Stream），我们可以利用之前提到0.11.0.0中的生产者新事物功能。消费者的位置作为消息存储到topic中，因此我们可以与接收处理后的数据的输出topic使用相同的事务写入offset到Kafka。如果事物中断，则消费者的位置将恢复到老的值，根据其”隔离级别“，其他消费者将不会看到输出topic的生成数据，在默认的”读取未提交“隔离级别中，所有消息对消费者都是可见的，即使是被中断的事务的消息。但是在”读取提交“中，消费者将只从已提交的事物中返回消息。

When writing to an external system, the limitation is in the need to coordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage of the consumer position and the storage of the consumers output. But this can be handled more simply and generally by letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, consider a Kafka Connect connector which populates data in HDFS along with the offsets of the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication.
当写入到外部系统时，需要将消费者的位置与实际存储为输出的位置进行协调。实现这一目标的典型方法是在消费者位置的存储和消费者输出的存储之间引入两阶段的”提交“。但是，这可以更简单，通过让消费者将其offset存储在与其输出相同的位置。这样最好，因为大多数的输出系统不支持两阶段”提交“。作为一个例子，考虑一个Kafka Connect连接器，它填充HDFS中的数据以及它读取的数据的offset，以保证数据和offset都被更新，或者都不更新。对于需要这些更强大语义的许多其他数据系统，我们遵循类似的模式，并且消息不具有允许重复数据删除的主键。

So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
kafka默认是保证“至少一次”传递，并允许用户通过禁止生产者重试和处理一批消息前提交它的偏移量来实现 “最多一次”传递。而“正好一次”传递需要与目标存储系统合作，但kafka提供了偏移量，所以实现这个很简单。

kafka副本和leader选举

副本

Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures.
kafka集群在各个服务器上备份topic分区中日志（ps：就是备份我们的消息，称为副本，你可以设置每个topic的副本数）。当集群中某个服务器发生故障时，自动切换到这些副本，从而保障在故障时消息仍然可用。

Other messaging systems provide some replication-related features, but, in our (totally biased) opinion, this appears to be a tacked-on thing, not heavily used, and with large downsides: slaves are inactive, throughput is heavily impacted, it requires fiddly manual configuration, etc. Kafka is meant to be used with replication by default—in fact we implement un-replicated topics as replicated topics where the replication factor is one.
其他消息系统提供一些副本相关的功能，但是，在我们看来（有偏见），这似乎是一个附加的东西，没有大量的使用，这有很大的缺点：slave不活跃，吞吐量受到严重影响，它需要的精确的手动配置等。kafka使用的是默认副本 — 就是不需要副本的topic的复制因子就是1。

The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single leader and zero or more followers. The total number of replicas including the leader constitute the replication factor. All reads and writes go to the leader of the partition. Typically, there are many more partitions than brokers and the leaders are evenly distributed among brokers. The logs on the followers are identical to the leader's log—all have the same offsets and messages in the same order (though, of course, at any given time the leader may have a few as-yet unreplicated messages at the end of its log).
副本以topic的分区为单位。在正常情况下，kafka每个分区都有一个单独的leader，0个或多个follower。副本的总数包括leader。所有的读取和写入到该分区的leader。通常，分区数比broker多，leader均匀分布在broker。follower的日志完全等同于leader的日志 — 相同的顺序相同的偏移量和消息（当然，在任何一个时间点上，leader比follower多几条消息，尚未同步到follower）

Followers consume messages from the leader just as a normal Kafka consumer would and apply them to their own log. Having the followers pull from the leader has the nice property of allowing the follower to naturally batch together log entries they are applying to their log.
Followers作为普通的消费者从leader中消费消息并应用到自己的日志中。并允许follower从leader拉取批量日志应用到自己的日志，这样具有良好的性能。

As with most distributed systems automatically handling failures requires having a precise definition of what it means for a node to be "alive". For Kafka node liveness has two conditions
和大多数分布式系统一样，自动处理失败的节点。需要精确的定义什么样的节点是“活着”的，对于kafka的节点活着有2个条件：

A node must be able to maintain its session with ZooKeeper (via ZooKeeper's heartbeat mechanism)
一个节点必须能维持与zookeeper的会话（通过zookeeper的心跳机制）
If it is a slave it must replicate the writes happening on the leader and not fall "too far" behind
如果它是一个slave，它必须复制leader并且不能落后"太多"

We refer to nodes satisfying these two conditions as being "in sync" to avoid the vagueness of "alive" or "failed". The leader keeps track of the set of "in sync" nodes. If a follower dies, gets stuck, or falls behind, the leader will remove it from the list of in sync replicas. The definition of, how far behind is too far, is controlled by the replica.lag.max.messages configuration and the definition of a stuck replica is controlled by the replica.lag.time.max.ms configuration.
我们让节点满足这2个“同步”条件，以区分“活着”还是“故障”。leader跟踪“同步”节点。如果一个follower死掉，卡住，或落后，leader将从同步副本列表中移除它。落后是通过replica.lag.max.messages配置控制，卡住是通过replica.lag.time.max.ms配置控制的。

In distributed systems terminology we only attempt to handle a "fail/recover" model of failures where nodes suddenly cease working and then later recover (perhaps without knowing that they have died). Kafka does not handle so-called "Byzantine" failures in which nodes produce arbitrary or malicious responses (perhaps due to bugs or foul play).
在分布式系统，我们只是尝试处理故障节点突然停止工作和然后稍后恢复的“故障/恢复”模式（也许不知道它们已经故障了）。kafka不处理节点产生任意或恶意的响应（也许是因为bug或犯规），所谓的“Byzantine”故障。

We can now more precisely define that a message is considered committed when all in sync replicas for that partition have applied it to their log. Only committed messages are ever given out to the consumer. This means that the consumer need not worry about potentially seeing a message that could be lost if the leader fails. Producers, on the other hand, have the option of either waiting for the message to be committed or not, depending on their preference for tradeoff between latency and durability. This preference is controlled by the acks setting that the producer uses. Note that topics have a setting for the "minimum number" of in-sync replicas that is checked when the producer requests acknowledgment that a message has been written to the full set of in-sync replicas. If a less stringent acknowledgement is requested by the producer, then the message can be committed, and consumed, even if the number of in-sync replicas is lower than the minimum (e.g. it can be as low as just the leader).
我们现在可以更精确的定义当该分区的所有同步副本已经写入到其日志中时漫该消息视为“已提交”。只有“已提交”的消息才会给到消费者。所有消费者无需担心如果leader故障，会消费到丢失的消息。另一方面，生产者可以选择等待消费提交，这取决于你更偏向延迟或耐用性（通过acks控制）。当生产者请求确保消息已经写入到全部的同步副本中（可以通过设置topic同步副本的“最小数”）。如果生产者要求不严格，则即使同步副本的数量低于最小值，也可以提交和消费该消息。

The guarantee that Kafka offers is that a committed message will not be lost, as long as there is at least one in sync replica alive, at all times.
kafka提供担保，在任何时候，只要至少有一个同步副本活着，承诺的消息就不会丢失。

Kafka will remain available in the presence of node failures after a short fail-over period, but may not remain available in the presence of network partitions.
kafka短暂的故障转移期间，失败的节点仍可用。但可能无法在网络分区仍然可用。

复制日志：Quorums，ISR，和状态机制

Quorum：原指为了处理事务、拥有做出决定的权力而必须出席的众议员或参议员的数量（一般指半数以上）。

At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the state-machine style.
kafka分区的核心是一个副本日志，副本是在分布式数据系统的最基础原始功能之一。并有许多方法实现，副本日志可以被其他系统用作状态机类型实现其他分布式系统的原始功能。

A replicated log models the process of coming into consensus on the order of a series of values (generally numbering the log entries 0, 1, 2, ...). There are many ways to implement this, but the simplest and fastest is with a leader who chooses the ordering of values provided to it. As long as the leader remains alive, all followers need to only copy the values and ordering, the leader chooses.
副本日志模拟了对一系列值顺序进入的过程（通常日志编号是 0，1，2，……）。有很多方法可以实现这一点，但最简单和最快的是leader提供选择的排序值，只要leader活着，所有的followers只需要复制和排序。

Of course if leaders didn't fail we wouldn't need followers! When the leader does die we need to choose a new leader from among the followers. But followers themselves may fall behind or crash so we must ensure we choose an up-to-date follower. The fundamental guarantee a log replication algorithm must provide is that if we tell the client a message is committed, and the leader fails, the new leader we elect must also have that message. This yields a tradeoff: if the leader waits for more followers to acknowledge a message before declaring it committed then there will be more potentially electable leaders.
当然，如果leader没有故障，我们就不需要follower！当leader确实故障了，我们需要从follower中选出新的leader，但是follower自己可能落后或崩溃，所以我们必须选择一个最新的follower。日志复制算法必须提供保证，如果我们告诉客户端消息是已发送，leader故障了，我们选举的新的leader必须要有这条消息，这就产生一个权衡：如果leader等待更多的follwer声明已提交之前，应答消息的话，将会有更多有资格的leader。

If you choose the number of acknowledgements required and the number of logs that must be compared to elect a leader such that there is guaranteed to be an overlap, then this is called a Quorum.
如果你选择需要应答数量必须和日志的数量进行比较，选出一个leader。这样保证有重叠，那么这就是所谓的Quorum（法定人数）。

A common approach to this tradeoff is to use a majority vote for both the commit decision and the leader election. This is not what Kafka does, but let's explore it anyway to understand the tradeoffs. Let's say we have 2f+1 replicas. If f+1 replicas must receive a message prior to a commit being declared by the leader, and if we elect a new leader by electing the follower with the most complete log from at least f+1 replicas, then, with no more than f failures, the leader is guaranteed to have all committed messages. This is because among any f+1 replicas, there must be at least one replica that contains all committed messages. That replica's log will be the most complete and therefore will be selected as the new leader. There are many remaining details that each algorithm must handle (such as precisely defined what makes a log more complete, ensuring log consistency during leader failure or changing the set of servers in the replica set) but we will ignore these for now.
一种常见的方法，用多数投票决定leader选举。kafka不是这样做的，但先让我们了解这个权衡，假如，我们有2f+1副本，如果f+1副本在leader提交之前必须收到消息，并且如果我们选举新的leader，至少从f+1副本选出最完整日志的follwer，并且不大于f的失败，leader担保所有已提交的信息。这是因为任何f+1副本中，必须至少有一个副本，其中包含所有已提交的消息。该副本的日志是最完整的，因此选定为新的leader。有许多其余细节，每个算法必须处理（如精确的定义是什么让一个日志更加完整，确保日志一致性，leader故障期间或更改服务器的副本集），但我们现在不讲这些。

This majority vote approach has a very nice property: the latency is dependent on only the fastest servers. That is, if the replication factor is three, the latency is determined by the faster slave not the slower one.
这种投票表决的方式有一个非常好的特性：仅依赖速度最快的服务器，也就是说，如果复制因子为三个，由最快的一个来确定。

There are a rich variety of algorithms in this family including ZooKeeper's Zab, Raft, and Viewstamped Replication. The most similar academic publication we are aware of to Kafka's actual implementation is PacificA from Microsoft.
有各种丰富的算法，包括zookeeper的Zab、 Raft和 Viewstamped Replication。kafka实现的最相似的学术理论是微软的PacificA。

The downside of majority vote is that it doesn't take many failures to leave you with no electable leaders. To tolerate one failure requires three copies of the data, and to tolerate two failures requires five copies of the data. In our experience having only enough redundancy to tolerate a single failure is not enough for a practical system, but doing every write five times, with 5x the disk space requirements and 1/5th the throughput, is not very practical for large volume data problems. This is likely why quorum algorithms more commonly appear for shared cluster configuration such as ZooKeeper but are less common for primary data storage. For example in HDFS the namenode's high-availability feature is built on a majority-vote-based journal, but this more expensive approach is not used for the data itself.
多数投票的缺点是，故障数还不太多的情况下会让你没有候选人可选，要容忍1个故障需要3个数据副本，容忍2个故障需要5个数据副本。实际的系统以我们的经验只能容忍单个故障的冗余是不够的，但是如果5个数据副本，每个写5次，5倍的磁盘空间要求，1/5的吞吐量，这对于大数据量系统是不实用的，这可能是quorum算法更通常在共享集群配置。如zookeeper，主要用于数据存储的系统是不太常见的。例如，在HDFS namenode的高可用性特性是建立在majority-vote-based journal，但这更昂贵的方法不能用于数据本身。

Kafka takes a slightly different approach to choosing its quorum set. Instead of majority vote, Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write. This ISR set is persisted to ZooKeeper whenever it changes. Because of this, any replica in the ISR is eligible to be elected leader. This is an important factor for Kafka's usage model where there are many partitions and ensuring leadership balance is important. With this ISR model and f+1 replicas, a Kafka topic can tolerate f failures without losing committed messages.
kafka采用了一种稍微不同的方法选择quorum，而不是多数投票，kafka动态维护一组同步leader数据的副本（ISR），只有这个组的成员才有资格当选leader，kafka副本写入不被认为是已提交，直到所有的同步副本已经接收才认为。这组ISR保存在zookeeper，正因为如此，在ISR中的任何副本都有资格当选leader，这是kafka的使用模型，有多个分区和确保leader平衡是很重要的一个重要因素。有了这个模型，ISR和f+1副本，kafka的主题可以容忍f失败而不会丢失已提交的消息。

For most use cases we hope to handle, we think this tradeoff is a reasonable one. In practice, to tolerate f failures, both the majority vote and the ISR approach will wait for the same number of replicas to acknowledge before committing a message (e.g. to survive one failure a majority quorum needs three replicas and one acknowledgement and the ISR approach requires two replicas and one acknowledgement). The ability to commit without the slowest servers is an advantage of the majority vote approach. However, we think it is ameliorated by allowing the client to choose whether they block on the message commit or not, and the additional throughput and disk space due to the lower required replication factor is worth it.
对于大多数情况下，我们希望这么处理，我们认为这个代价是合理的，在实践中，容忍f故障，多数投票和ISR方法将等待相同数量的副本提交消息之前进行确认（例如：活着1个，故障多数的quorum，需要3个副本和1个应答，ISR方法需要2个副本和1个应答）。排除最慢的服务器是多数投票的优点，但是，我们认为允许客户选择是否阻塞消息的提交可以改善这个问题，并通过降低复制因子获得额外的吞吐量和磁盘空间也是值得的。

Another important design distinction is that Kafka does not require that crashed nodes recover with all their data intact. It is not uncommon for replication algorithms in this space to depend on the existence of "stable storage" that cannot be lost in any failure-recovery scenario without potential consistency violations. There are two primary problems with this assumption. First, disk errors are the most common problem we observe in real operation of persistent data systems and they often do not leave data intact. Secondly, even if this were not a problem, we do not want to require the use of fsync on every write for our consistency guarantees as this can reduce performance by two to three orders of magnitude. Our protocol for allowing a replica to rejoin the ISR ensures that before rejoining, it must fully re-sync again even if it lost unflushed data in its crash.
另一个重要的区别是，kafka不要求节点崩溃后所有的数据保持原样恢复。不违反一致性，在任何故障恢复场景不丢失的“稳定存储”复制算法是极少的。这种假设有两个主要的问题，首先，根据我们的观察，磁盘错误在持久化数据系统是最常见的问题，通常数据不会完好无损。其次，即使这不是一个问题，我们不希望在每次写入都用fsync做一致性的保障。因为这导致2个至3个数量级的性能下降，我们允许一个副本重新加入ISR协议确保在加入之前，必须再次完全重新同步，即使丢失崩溃未刷新的数据。

unclean leader选举：如果他们都死了怎么办？

Note that Kafka's guarantee with respect to data loss is predicated on at least on replica remaining in sync. If all the nodes replicating a partition die, this guarantee no longer holds.
请注意，kafka对数据丢失的保障是基于至少有一个副本在保持同步。如果分区的所有复制节点都死了，这保证就不再成立。

However a practical system needs to do something reasonable when all the replicas die. If you are unlucky enough to have this occur, it is important to consider what will happen. There are two behaviors that could be implemented:
如果你人品超差，遇到所有的副本都死了，这时候，你要考虑将会发生问题，并做重要的2件事：

Wait for a replica in the ISR to come back to life and choose this replica as the leader (hopefully it still has all its data).
等待在ISR中的副本起死回生并选择该副本作为leader（希望它仍有所有数据）。
Choose the first replica (not necessarily in the ISR) that comes back to life as the leader.
选择第一个副本（不一定在 ISR)，作为leader。

This is a simple tradeoff between availability and consistency. If we wait for replicas in the ISR, then we will remain unavailable as long as those replicas are down. If such replicas were destroyed or their data was lost, then we are permanently down. If, on the other hand, a non-in-sync replica comes back to life and we allow it to become leader, then its log becomes the source of truth even though it is not guaranteed to have every committed message. In our current release we choose the second strategy and favor choosing a potentially inconsistent replica when all replicas in the ISR are dead.This behavior can be disabled using configuration property unclean.leader.election.enable, to support use cases where downtime is preferable to inconsistency.
这是在可用性和一致性的简单权衡。如果我们等待ISR中的副本，那么只要副本不可用，我们将保持不可用，如果这些副本摧毁或数据已经丢失，那么就是永久的不可用。另一方面，如果non-in-sync（非同步）的副本，我们让它成为leader，让它的日志成为源，即使它不能保证承诺的消息不丢失。在我们当前的版本中我们选择第2种方式，支持选择在ISR中所有副本死了时候可选择不能保证一致的副本。可以通过配置unclean.leader.election.enable禁用此行为，以支持停机优先于不一致。

This dilemma is not specific to Kafka. It exists in any quorum-based scheme. For example in a majority voting scheme, if a majority of servers suffer a permanent failure, then you must either choose to lose 100% of your data or violate consistency by taking what remains on an existing server as your new source of truth.
这个难题不是只有kafka有，任何基于quorum的都有。例如在多数投票中，如果多数服务器都遭受永久性的故障，那么你必须选择丢失100%的数据，或违反一致性，用剩下现有服务器作为新的源。

可用性和耐久性保证（Availability and Durability Guarantees）

When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas. Note that "acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message. By default, when acks=all, acknowledgement happens as soon as all the current in-sync replicas have received the message. For example, if a topic is configured with only two replicas and one fails (i.e., only one in sync replica remains), then writes that specify acks=all will succeed. However, these writes could be lost if the remaining replica also fails. Although this ensures maximum availability of the partition, this behavior may be undesirable to some users who prefer durability over availability. Therefore, we provide two topic-level configurations that can be used to prefer message durability over availability:
当写入到kakfa时，生产者可以选择是否等待0,1 或全部副本（-1）的消息确认。需要注意的是“所有副本确认”并不能保证全部分配副本已收到消息。默认情况下，当acks=all时，只要当前所有在同步中的副本收到消息，就会进行确认。例如：如果一个topic有2个副本，有一个故障（即，只剩下一个同步副本），即使写入是 acks=all 也将会成功。如果剩下的副本也故障了那么这些写入就会丢失。虽然这可以确保分区的最大可用性，这种方式可能不受欢迎，一些用户喜欢耐久性超过可用性。因此，我们提供两种配置。

Disable unclean leader election - if all replicas become unavailable, then the partition will remain unavailable until the most recent leader becomes available again. This effectively prefers unavailability over the risk of message loss. See the previous section on Unclean Leader Election for clarification.
禁用unclean leader选举 - 如果所有副本不可用，那份分区将一直不可用，直到最近的leader再次变得可用，这种宁愿不可用，而不是冒着丢失消息的风险。

Specify a minimum ISR size - the partition will only accept writes if the size of the ISR is above a certain minimum, in order to prevent the loss of messages that were written to just a single replica, which subsequently becomes unavailable. This setting only takes effect if the producer uses required.acks=-1 and guarantees that the message will be acknowledged by at least this many in-sync replicas. This setting offers a trade-off between consistency and availability. A higher setting for minimum ISR size guarantees better consistency since the message is guaranteed to be written to more replicas which reduces the probability that it will be lost. However, it reduces availability since the partition will be unavailable for writes if the number of in-sync replicas drops below the minimum threshold.
指定一个最小的ISR大小 — 如果ISR的大小高于最小值，则该分区才接受写入，以预防消息丢失，防止消息写到单个副本上，则让其变为不可用。如果生产者使用的是acks=all并保证最少这些同步分本已确认，则设置才生效。该设置提供一致性和可用性之间的权衡。ISR的大小设置的越高更好的保证一致性，因为消息写到更多的副本以减少消息丢失的风险。但是，这样降低了可用性，因为如果同步副本数低于最小的阈值，则该分区将不可写入。

副本管理

The above discussion on replicated logs really covers only a single log, i.e. one topic partition. However a Kafka cluster will manage hundreds or thousands of these partitions. We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partitions for high-volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional share of its partitions.
上面讨论的复制日志只说了单个日志，即一个topic的分区，然而，kafka集群需要管理成百上千的分区，我们试图用循环的方式在集群内平衡分区，以避免高容量高热度的主题的所有分区仅在少数几个节点上。同样，我们尽量使每个节点都是其分区按比例分担平衡的leader。

It is also important to optimize the leadership election process as that is the critical window of unavailability. A naive implementation of leader election would end up running an election per partition for all partitions a node hosted when that node failed. Instead, we elect one of the brokers as the "controller". This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions in a failed broker. The result is that we are able to batch together many of the required leadership change notifications which makes the election process far cheaper and faster for a large number of partitions. If the controller fails, one of the surviving brokers will become the new controller.
同样重要的是优化leader选举的过程，一个傻的实现是当节点故障，leader将在运行中的所有分区中选举一个节点来托管。相反的，我们选出一个broker作为“控制器”。这个控制器检查broker级别故障和负责改变所有故障的broker中的受影响的leader的分区，这样的好处是，我们能够批量处理多个需要leader变更的分区，这使得选举更廉价、更快。如果控制器发生故障，在幸存的broker之中，将选举一个成为新的控制器。

kafka日志压缩

日志压缩

Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. It addresses use cases and scenarios such as restoring state after application crashes or system failure, or reloading caches after application restarts during operational maintenance. Let's dive into these use cases in more detail and then describe how compaction works.

日志压缩确保kafka始终保留至少单个topic分区数据中每条消息key的最后的值。它解决了一些用例和场景，如应用程序崩溃或系统故障后还原状态，或应用程序在运行维护过程中重新启动后重新加载缓存。让我们深入讨论这些使用中的更多细节，描述它是如何压缩的。

So far we have described only the simpler approach to data retention where old log data is discarded after a fixed period of time or when the log reaches some predetermined size. This works well for temporal event data such as logging where each record stands alone. However an important class of data streams are the log of changes to keyed, mutable data (for example, the changes to a database table).

目前为止，我们只简单说了方法，旧的数据保留一段固定的时间，或当日志达到规定的大小后丢弃。这非常适用于暂时性事件数据，如日志记录，每个记录是独立的。而是一类重要的数据流，日志是变化的，可变的数据（例如：更改数据库表）。

Let's discuss a concrete example of such a stream. Say we have a topic containing user email addresses; every time a user updates their email address we send a message to this topic using their user id as the primary key. Now say we send the following messages over some time period for a user with id 123, each message corresponding to a change in email address (messages for other ids are omitted):

让我们来讨论一个关于流的具体的例子，假设我们有一个topic里包含用户的email地址，每次用户更新他们的email地址，我们发送一条消息到这个topic，使用用户Id作为主键。现在，我们在一段时间内为id为123的用户发送一些消息，每个消息对应email地址的改变（其他ID消息省略）:

    123 => bill@microsoft.com
            .
            .
            .
    123 => bill@gatesfoundation.org
            .
            .
            .
    123 => bill@gmail.com

Log compaction gives us a more granular retention mechanism so that we are guaranteed to retain at least the last update for each primary key (e.g.bill@gmail.com). By doing this we guarantee that the log contains a full snapshot of the final value for every key not just keys that changed recently. This means downstream consumers can restore their own state off this topic without us having to retain a complete log of all changes.

日志压缩为我提供了更精细的保留机制，所以我们至少保留每个主键的最后一次更新（例如：bill@gmail.com）。这样我们保证日志包含每一个key的最终值而不只是最近变更的完整快照。这意味着下游消费者可以获得最终的状态而无需拿到所有的变化的消息信息。

Let's start by looking at a few use cases where this is useful, then we'll see how it can be used.

让我们先看几个有用的用例，然后我们再看到怎么使用它。

Database change subscription. It is often necessary to have a data set in multiple data systems, and often one of these systems is a database of some kind (either a RDBMS or perhaps a new-fangled key-value store). For example you might have a database, a cache, a search cluster, and a Hadoop cluster. Each change to the database will need to be reflected in the cache, the search cluster, and eventually in Hadoop. In the case that one is only handling the real-time updates you only need recent log. But if you want to be able to reload the cache or restore a failed search node you may need a complete data set.
数据库更改订阅，通常需要在多个数据库系统，有一个数据集，这些系统中通常有一个是某种类型的数据库（无论是RDBMS或者新流行的key-value仓库）。例如，你可能有一个数据库，缓存，搜索集群，以及Hadoop集群。每次变更数据库，也同时需要变更缓存，搜索集群，和Hadoop。在只需处理最新日志的实时更新的情况下，你只需要最近的日志。但是，如果你希望能够重新加载缓存或恢复搜索失败的节点，你可能需要一个完整的数据集。
Event sourcing. This is a style of application design which co-locates query processing with application design and uses a log of changes as the primary store for the application.
事件源。查询处理与应用设计共存，这是一种应用程序的设计风格，并使用一个变更日志作为应用程序的主仓库。
Journaling for high-availability. A process that does local computation can be made fault-tolerant by logging out changes that it makes to it's local state so another process can reload these changes and carry on if it should fail. A concrete example of this is handling counts, aggregations, and other "group by"-like processing in a stream query system. Samza, a real-time stream-processing framework, uses this feature for exactly this purpose.
高可用的日志：本地的计算进程可以通过注销它自己的本地状态的变更进行容错。使另一个进程可以重载加载这些更改并继续执行（如果它故障）。例如：处理计数、聚合和其他的“group by”，- 像流查询系统。Samza，实时流处理框架，使用这个特性正是出于这一原因。

In each of these cases one needs primarily to handle the real-time feed of changes, but occasionally, when a machine crashes or data needs to be re-loaded or re-processed, one needs to do a full load. Log compaction allows feeding both of these use cases off the same backing topic. This style of usage of a log is described in more detail in this blog post.

在每一种情况下，首先需要去处理实时变更的feed（ps：新请求来的消息），但是偶尔，当机器崩溃或数据需要重新加载或重新处理时，需要做完整的加载。数据压缩允许feed这2种用例，这种风格的更详细的请看博客帖子。

The general idea is quite simple. If we had infinite log retention, and we logged each change in the above cases, then we would have captured the state of the system at each time from when it first began. Using this complete log we could restore to any point in time by replaying the first N records in the log. This hypothetical complete log is not very practical for systems that update a single record many times as the log will grow without bound even for a stable dataset. The simple log retention mechanism which throws away old updates will bound space but the log is no longer a way to restore the current state—now restoring from the beginning of the log no longer recreates the current state as old updates may not be captured at all.

通常想法是很简单，如果我们有无限的日志保留，我们记录每个变更，在上述的情况下，那么我们就从当它第一次开始每次捕获系统状态。保留完整日志，我们可以通过在日志重放第一个N记录来恢复到任意的时间点。这个假设的完整的日志对单条记录更新多次的系统是很不实用，即使是一个稳定的数据集，但日志将无线增长。一个简单的机制是扔掉旧日志，但是日志不能在恢复了。

Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.

日志压缩是一种机制给每条细粒度的保留，而不是基于时间的粗粒度的保留，是有选择地删除记录，我们保留相同的主键的最新记录。这种方式的日志保证至少有每个key的最后状态。

This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or time and other topics where retention is enforced by compaction.

可以为每个topic设置保存策略，可以通过大小或时间，以及通过其他的压缩方式保存。

This functionality is inspired by one of LinkedIn's oldest and most successful pieces of infrastructure—a database changelog caching service called Databus. Unlike most log-structured storage systems Kafka is built for subscription and organizes data for fast linear reads and writes. Unlike Databus, Kafka acts a source-of-truth store so it is useful even in situations where the upstream data source would not otherwise be replayable.

此功能的灵感来自LinkedIn最古老和最成功的基础设施 — 一个叫做Databus的 “数据库更新日志缓存服务”。不像大多数日志结构的存储系统，Kafka是专门为订阅和快速线性的读和写的组织数据，不同与Databus，kafka作为source-of-truth（真源：这里简单解释一下，消息发送到kafka这里，那么kafka里的消息就是最真的源了，因为如果kafka宕机了，从kafka的角度来讲，那kafka能自己恢复消息吗？不能，因为它不知道找谁，因此，kafka里面的消息就是真的源头数据），因此非常利于那些上游数据无法回放的情形。

日志压缩基础

Here is a high-level picture that shows the logical structure of a Kafka log with the offset for each message.

这是一个高级别的日志逻辑图，展示了kafka日志的每条消息的offset逻辑结构。

The head of the log is identical to a traditional Kafka log. It has dense, sequential offsets and retains all messages. Log compaction adds an option for handling the tail of the log. The picture above shows a log with a compacted tail. Note that the messages in the tail of the log retain the original offset assigned when they were first written—that never changes. Note also that all offsets remain valid positions in the log, even if the message with that offset has been compacted away; in this case this position is indistinguishable from the next highest offset that does appear in the log. For example, in the picture above the offsets 36, 37, and 38 are all equivalent positions and a read beginning at any of these offsets would return a message set beginning with 38.

的kafka日志。它是密集的，连续offset，并保存所有的消息。日志压缩增加了一个选项来处理尾部(tail)的日志，上图显示了一个尾部压缩日志。另外，日志尾部已分配的消息将保留原来的偏移量 —— 永远不会改变，还要注意，在日志中所有的偏移量仍然保持有效的位置，即使消息已经压缩，在这种情况下，在日志的下一个最高的offset的位置是无法区分的。例如，上图的偏移量36，37和38都是等效的位置，读这些offset都将返回消息集的开始位置38。

Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retained is marked as the "delete retention point" in the above diagram.

压缩也允许删除。通过消息的key和空负载（null payload）来标识该消息可从日志中删除。这个删除标记将导致删除在这个key之前的任何消息（以及该key的所有新消息）。但是删除标记是特殊的，他们自己去清理日志，在一段时间之后释放空间。在删除的时候不再保留标记作为“删除保存点”，如上图。

The compaction is done in the background by periodically recopying log segments. Cleaning does not block reads and can be throttled to use no more than a configurable amount of I/O throughput to avoid impacting producers and consumers. The actual process of compacting a log segment looks something like this:

压缩是在后台通过定期重新复制日志段来完成的。清洗不会阻塞读，可以限流I/O吞吐量（是可配置），以避免影响生产者和消费者。实际压缩处理日志看起来像这样：

日志压缩提供什么保障？

Log compaction guarantees the following:
日志压缩保障如下：

Any consumer that stays caught-up to within the head of the log will see every message that is written; these messages will have sequential offsets. The topic's min.compaction.lag.ms can be used to guarantee the minimum length of time must pass after a message is written before it could be compacted. I.e. it provides a lower bound on how long each message will remain in the (uncompacted) head.
任何滞留在日志head中的所有消费者能看到写入的所有消息；这些消息都是有序的offset。topic的使用min.compaction.lag.ms用来保障消息写入之前必须经过的最小时间长度，才能被压缩。也就是说，它提供了消息保留在head（未压缩）的最少时间。
Ordering of messages is always maintained. Compaction will never re-order messages, just remove some.
始终保持消息的排序。压缩永远不会重新排序消息，只是删除了一些。
The offset for a message never changes. It is the permanent identifier for a position in the log.
消息的偏移量永远不会改变。消息在日志中的位置将永久保存。
Any consumer progressing from the start of the log, will see at least the final state of all records in the order they were written. All delete markers for deleted records will be seen provided the consumer reaches the head of the log in a time period less than the topic'sdelete.retention.mssetting (the default is 24 hours). This is important as delete marker removal happens concurrently with read, and thus it is important that we do not remove any delete marker prior to the consumer seeing it.
从日志开始消费的所有消费者将至少看到其按顺序写入的最终状态的消息。此外，假如消费者在小于topic的delete.retention.ms setting设置的时间段（默认24小时）到达日志的head。将会看到所有已删除消息的删除标记。换句话说：由于删除日志与读取同时发生，消费者将优于删除。

日志压缩的细节

Log compaction is handled by the log cleaner, a pool of background threads that recopy log segment files, removing records whose key appears in the head of the log. Each compactor thread works as follows:
日志cleaner处理日志压缩，后台线程池重新复制日志段文件，移除在日志head中出现的消息。每个压缩线程工作方式如下：

It chooses the log that has the highest ratio of log head to log tail
选择log head（日志头）到log tail（日志尾）比率最高的日志。
It creates a succinct summary of the last offset for each key in the head of the log
在head日志中为每个key的最后offset创建一个的简单概要。
It recopies the log from beginning to end removing keys which have a later occurrence in the log. New, clean segments are swapped into the log immediately so the additional disk space required is just one additional log segment (not a fully copy of the log).
它从日志的开始到结束，删除那些在日志中最新出现的key，新的，干净的段将立刻交换到日志中。因此，所需的额外磁盘空间只是一个额外的日志段（不是日志的完整副本）。
The summary of the log head is essentially just a space-compact hash table. It uses exactly 24 bytes per entry. As a result with 8GB of cleaner buffer one cleaner iteration can clean around 366GB of log head (assuming 1k messages).
日志head的概要本质上是一个空间密集型的哈希表，每个entry使用固定的24byte。这样8GB的cleaner buffer一次迭代可清理大约366GB的日志（假设消息1K）。

配置Log Cleaner

The log cleaner is enabled by default. This will start the pool of cleaner threads. To enable log cleaning on a particular topic you can add the log-specific property
Log cleaner默认是启动的。也将启动cleaner线程池。你也可以针对特定topic启用log清洁，通过

log.cleanup.policy=compact

This can be done either at topic creation time or using the alter topic command.

可以在创建topic时或使用alter topic命令指定。

The log cleaner can be configured to retain a minimum amount of the uncompacted "head" of the log. This is enabled by setting the compaction time lag.
log cleaner可以配置保留日志“head”不压缩的最小数。通过设置压缩延迟时间。

1log.cleaner.min.compaction.lag.ms

This can be used to prevent messages newer than a minimum message age from being subject to compaction. If not set, all log segments are eligible for compaction except for the last segment, i.e. the one currently being written to. The active segment will not be compacted even if all of its messages are older than the minimum compaction time lag.
这可以预防消息在一个最小消息时间绝不会被压缩。如果不设置，除了最新的段，其他所有的段都是可以压缩的，即，当前正在写入的那个。即使其所有消息都比最小压缩时间滞后更长，正在写入的段也不会被压缩。

Further cleaner configurations are described here.
关于cleaner更详细的配置在这里。

kafka配额

4.9 配额

从0.9开始，Kafka集群能够对生产和消费设置配额。为每个客户端分组设置配额阈值（基于字节比率）。

Kafka集群有能力对请求进行配额来控制客户端使用的broker资源。可以为共享配额的每个客户组执行两种类型的客户配额：

通过配额定义网络带宽的字节率阈值（从0.9版本开始）
请求率配额将CPU的利用率阈值定义为网络和I/O线程的百分比（自0.11版本起）

为什么需要配额？

生产者和消费者的可能生产/消费非常大量的数据，从而垄断了broker的资源，引起网络饱和，配额可防止这个问题。在大型多节点集群中更加重要。其中有一小部分不良行为的用户将被降权。事实上，当kafka作为服务运行时，可以根据约定好的协议执行API限制。

客户端集群

在安全的集群中，Kafka客户端通过已认证的用户的principal来标识的。在一个无需认证客户端的集群中，用户principal是一个未认证的分组。用户的principal是通过broker（使用可配置的PrincipalBuilder）。Client-id是客户端应用程序使用具有意义名称的客户端的逻辑分组。元组（user，client-id）定义了共享用户principal和client-id的客户端安全逻辑分组。

配额可以应用到（user，client-id），用户或clinet-id组中。为了一个给定的连接，应用与连接匹配的最具体的配额。所有的配额分组的连接
共享配置的分组配额。例如，如果（user=“test-user”,client-id="test-client"）生产配额是10MB/sec，那么将会应用到所有生产者用户是“test-user”和clinet-id“是test-client”的实例上。

配额配置

为user和client-id分组定义配额配置。可根据自身需要去覆盖默认的配置，这个机制类似于topic日志配置覆盖。用户和（user, client-id）配额覆盖写在ZooKeeper的/config/users下,client-id配额覆盖写在/config/clients下。这些配置被所有broker读取，并立即生效。并且我们更改配置而无需重启整个集群。点击这里查看更多细节。每个分组默认的配额也可使用相同的机制来动态地更新。

配额配置的优先级顺序为：

1. /config/users/<user>/clients/<client-id>
2. /config/users/<user>/clients/<default>
3. /config/users/<user>
4. /config/users/<default>/clients/<client-id>
5. /config/users/<default>/clients/<default>
6. /config/users/<default>
7. /config/clients/<client-id>
8. /config/clients/<default>

可以通过Broker配置（quota.producer.default, quota.consumer.default）为client-id分组设置默认的网络带宽配额。但已不赞成使用，并将在后面的版本移除。client-id的默认的配额可以在zookeeper设置（类似其他的配额覆盖和默认）。

网络带宽配额

网络带宽配额被定义为的字节速率阈值（客户端的每个分组共享的配额）。默认情况下，每个独立的客户端分组按照集群的配置接收固定的配额（字节/秒）。这个配额是基于每个broker上的定义。每个客户端分组在客户端被限制之前发布/获取每个broker的最大X字节/秒。

请求的比率配额

请求比率配额定义为一个客户端可以请求利用配额窗口中的每个broker的请求处理I/O线程和网络线程的时间百分比。n%的配额代表一个线程的n%，因此配额超出 ((num.io.threads + num.network.threads) * 100)%的总容量。在限制之前，每个客户端分组可能在配额窗口的所有I/O和网络线程使用高达n%的总百分比。由于分配给I/O和网络线程的线程数通常基于broker主机上可用的核心数量（CPU核心数），所以请求率配额代表共享配额的每组客户端可能使用的CPU的总百分比。

强制执行

默认情况下，每个独立的客户端分组接收一个固定的配额（集群配置的）。该配额基于每个broker定义的。每个客户端在被限制之前都使用该配额。我们决定，为每个broker定义的这些配额比固定的集群带宽要好得多，因为需要一个机制在所有broker中共享客户端配额。这可能比配额实现本身更难！

当检测到配额违规时，broker该做出什么样的反应？在我们的解决方案中，broker不返回错误，而是尝试减慢超出其配额的客户端。它计算延迟量，使违规的客户端根据其配额并延迟该时段的响应时间。这个方法保持对客户端违法配额透明（除client度量）。这也使他们不必实现任何特殊的回退或重试（否则可能会变得很麻烦）。事实上，坏的客户端行为（不重试回退）可能加速尝试解决的配额问题。

字节率和线程利用率是通过多个小窗口（例如每个1秒的30个窗口）来测量，以便快速检测和纠正配额违规。通常，具有大的测量窗口（例如，每个30秒的10个窗口）导致大量的流量突发，随后引起长时间的延迟，这在用户体验方面不好。

kafka接口设计

5.1 API设计

生产者API

The Producer API that wraps the 2 low-level producers - kafka.producer.SyncProducer and kafka.producer.async.AsyncProducer.

生产者API，它封装了2个低级别的生产者 - kafka.producer.SyncProducer 和 kafka.producer.async.AsyncProducer。

class Producer {
  /* Sends the data, partitioned by key to the topic using either the */
  /* synchronous or the asynchronous producer */
  public void send(kafka.javaapi.producer.ProducerData<K,V> producerData);
 
  /* Sends a list of data, partitioned by key to the topic using either */
  /* the synchronous or the asynchronous producer */
  public void send(java.util.List<kafka.javaapi.producer.ProducerData<K,V>> producerData);
 
  /* Closes the producer and cleans up */
  public void close();
 
}

View Code

The goal is to expose all the producer functionality through a single API to the client. The new producer -
通过API提供给客户端，来暴露生产者所有的功能。新的生产者 -

can handle queueing/buffering of multiple producer requests and asynchronous dispatch of the batched data -
可以处理多个生产者请求和异步批量数据派发的队列/缓冲 -

kafka.producer.Producer provides the ability to batch multiple produce requests (producer.type=async), before serializing and dispatching them to the appropriate kafka broker partition. The size of the batch can be controlled by a few config parameters. As events enter a queue, they are buffered in a queue, until either queue.time or batch.sizeis reached. A background thread (kafka.producer.async.ProducerSendThread) dequeues the batch of data and lets the kafka.producer.EventHandler serialize and send the data to the appropriate kafka broker partition. A custom event handler can be plugged in through the event.handler config parameter. At various stages of this producer queue pipeline, it is helpful to be able to inject callbacks, either for plugging in custom logging/tracing code or custom monitoring logic. This is possible by implementing the kafka.producer.async.CallbackHandler interface and setting callback.handler config parameter to that class.
kafka.producer.Producer 提供批处理多个生产请求的能力（producer.type=async），序列化和派发到broker分区之前，可以配置参数控制批量的大小。随着事件进入队列，缓存在队列，直到满足 queue.time 或 batch.size。后台线程（kafka.producer.async.ProducerSendThread）发送一批数据用kafka.producer.EventHandler序列化并发送到broker分区。还可以通过event.handler配置参数可以插入自定义的事件处理。生产者队列管道在不同的阶段，无论是插入自动的日志记录/跟踪代码或自定义的监控逻辑，能够注入回调，通过实现kafka.producer.async.CallbackHandler接口和设置callback.handler配置参数的类。
handles the serialization of data through a user-specified Encoder-
处理数据序列化，通过用户指定的 Encoder-
```
interface Encoder<T> {
  public Message toMessage(T data);
}
```
The default is the no-op kafka.serializer.DefaultEncoder
默认是空操作kafka.serializer.DefaultEncoder
provides software load balancing through an optionally user-specified Partitioner-
提供平衡负载，通过用户指定的Partitioner-

The routing decision is influenced by the kafka.producer.Partitioner.
路由决定由kafka.producer.Partitioner影响。
```
interface Partitioner<T> {
   int partition(T key, int numPartitions);
}
```
The partition API uses the key and the number of available broker partitions to return a partition id. This id is used as an index into a sorted list of broker_ids and partitions to pick a broker partition for the producer request. The default partitioning strategy ishash(key)%numPartitions. If the key is null, then a random broker partition is picked. A custom partitioning strategy can also be plugged in using thepartitioner.classconfig parameter.
该分区API，使用key和可用broker分区数，返回一个分区ID。这个id用作索引broker_ids和分区排序列表来为生产者请求挑选一个broker分区。默认的分区策略是hash（key）% numPartitions。如果key是空，则随机broker分区，还可以插入自定义分区策略使用partitioner.class配置参数。

消费者API

We have 2 levels of consumer APIs. The low-level "simple" API maintains a connection to a single broker and has a close correspondence to the network requests sent to the server. This API is completely stateless, with the offset being passed in on every request, allowing the user to maintain this metadata however they choose.
我们有两个层次的消费者API，低级别的“简单”API保持一个broker的连接，并有密切通讯网络发送到服务器的请求。这个API完全无状态的，偏移量被传递到每个请求，允许用户保持这个元数据。

The high-level API hides the details of brokers from the consumer and allows consuming off the cluster of machines without concern for the underlying topology. It also maintains the state of what has been consumed. The high-level API also provides the ability to subscribe to topics that match a filter expression (i.e., either a whitelist or a blacklist regular expression).
高级别API封装broker消费具体细节。它维持已消费的状态，高级别API还提供了通过匹配表达式（或过滤）的方式获取指定topic（即，无论是白名单或黑名单的正则表达式）。

低级别API

class SimpleConsumer {
     
  /* Send fetch request to a broker and get back a set of messages. */
  public ByteBufferMessageSet fetch(FetchRequest request);
 
  /* Send a list of fetch requests to a broker and get back a response set. */
  public MultiFetchResponse multifetch(List<FetchRequest> fetches);
 
  /**
   * Get a list of valid offsets (up to maxSize) before the given time.
   * The result is a list of offsets, in descending order.
   * @param time: time in millisecs,
   *              if set to OffsetRequest$.MODULE$.LATIEST_TIME(), get from the latest offset available.
   *              if set to OffsetRequest$.MODULE$.EARLIEST_TIME(), get from the earliest offset available.
   */
  public long[] getOffsetsBefore(String topic, int partition, long time, int maxNumOffsets);
}

View Code

The low-level API is used to implement the high-level API as well as being used directly for some of our offline consumers (such as the hadoop consumer) which have particular requirements around maintaining state.
低级别的API用于实现高级别的API，以及直接用于一些有特定需求的离线的消费者。

高级别API

/* create a connection to the cluster */
ConsumerConnector connector = Consumer.create(consumerConfig);
 
interface ConsumerConnector {
     
  /**
   * This method is used to get a list of KafkaStreams, which are iterators over
   * MessageAndMetadata objects from which you can obtain messages and their
   * associated metadata (currently only topic).
   *  Input: a map of <topic, #streams>
   *  Output: a map of <topic, list of message streams>
   */
  public Map<String,List<KafkaStream>> createMessageStreams(Map<String,Int> topicCountMap);
 
  /**
   * You can also obtain a list of KafkaStreams, that iterate over messages
   * from topics that match a TopicFilter. (A TopicFilter encapsulates a
   * whitelist or a blacklist which is a standard Java regex.)
   */
  public List<KafkaStream> createMessageStreamsByFilter(
      TopicFilter topicFilter, int numStreams);
 
  /* Commit the offsets of all messages consumed so far. */
  public commitOffsets()
   
  /* Shut down the connector */
  public shutdown()
}

View Code

This API is centered around iterators, implemented by the KafkaStream class. Each KafkaStream represents the stream of messages from one or more partitions on one or more servers. Each stream is used for single threaded processing, so the client can provide the number of desired streams in the create call. Thus a stream may represent the merging of multiple server partitions (to correspond to the number of processing threads), but each partition only goes to one stream.

这个API是围绕中心的迭代器，KafkaStream类实现的，每个KafkaStream代表一个或多个服务器上的多个分区的消息流，每个流都是单个线程处理，因此客户可以提供所有要的数量创建调用流，因此，流可能代表多个服务器分区的合并（对应处理线程的数量），但每个分区只能进入一个流。

The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing. The createMessageStreamsByFilter call (additionally) registers watchers to discover new topics that match its filter. Note that each stream that createMessageStreamsByFilter returns may iterate over messages from multiple topics (i.e., if multiple topics are allowed by the filter).

createMessageStreams调用注册消费者的topic，这将导致重新平衡消费者/broker分配。API鼓励创建单个调用多个的topic流，以尽量减少这种再平衡。createMessageStreamsByFilter调用（另外）登记观察者发现符合过滤器的新topic。需要注意的是，每个流的createMessageStreamsByFilter返回的消息可能遍历了多个topic（既，如果是过滤器允许多个topic）。

kafka网络层

The network layer is a fairly straight-forward NIO server, and will not be described in great detail. The sendfile implementation is done by giving the MessageSet interface a writeTo method. This allows the file-backed message set to use the more efficient transferTo implementation instead of an in-process buffered write. The threading model is a single acceptor thread and N processor threads which handle a fixed number of connections each. This design has been pretty thoroughly tested elsewhere and found to be simple to implement and fast. The protocol is kept quite simple to allow for future implementation of clients in other languages.
网络层是一个很直接的NIO服务。这里不详细说明。sendfile实现是MessageSet接口的writeTo方法。允许文件备份消息集使用更有效的transferTo实现而不是在过程中使用缓存写入。线程模型是一个单接收线程和处理每个连接的N个处理器线程。这种设计已经在别的地方得到充分验证，发现实现简单、快速。该协议很简单，方便以后客户端用其他语言来实现。

kafka消息格式

5.3 kafka消息格式

消息（又名记录）始终是按批次写入。一批消息用技术术语表达就是记录批次，记录批次包含一个或多个记录。在低性能的情况下，一个批次只有单条消息。记录批次和记录都有自己的头文件。下面介绍了Kafka版本0.11.0及更高版本（消息格式版本v2或magic = 2）的格式。点击此处查看有关邮件格式0和1的详细信息。

5.3.1 消息批次

以下是RecordBatch的磁盘格式。

baseOffset: int64
batchLength: int32
partitionLeaderEpoch: int32
magic: int8 (current magic value is 2)
crc: int32
attributes: int16
    bit 0~2:
        0: no compression
        1: gzip
        2: snappy
        3: lz4
    bit 3: timestampType
    bit 4: isTransactional (0 means not transactional)
    bit 5: isControlBatch (0 means not a control batch)
    bit 6~15: unused
lastOffsetDelta: int32
firstTimestamp: int64
maxTimestamp: int64
producerId: int64
producerEpoch: int16
baseSequence: int32
records: [Record]

请注意，当启用压缩时，压缩记录数据将按照记录数的计数直接序列化的。

CRC涵盖从属性到批次结束的数据（即。CRC之后的所有的字节）。它位于magic字节之后，也就是说，在决定如何解析批次长度和magic字节之前，客户端必须解析magic字节。分区leader epoch字段不包括在CRC计算中，以避免broker接收的每个批次分配该字段时，重新计算CRC。CRC-32C (Castagnoli)多项式被用于计算。

压缩：与老的消息格式不同，magic v2及以上版本在日志清理时，保留原始批次中的第一个和最后一个offset/sequence号。为了当日志重新加载时能恢复生产者的状态。这个功能是必须的，假设，如果我们不保留最后的序列数。如果分区leader故障，则生产者将看到OutOfSequence错误。必须保留基本序列号以进行重复检查（broker通过验证传入批次的第一个和最后一个序列号与该生产者的最后一个序列号相匹配）。因此，当清除批次中的所有消息，以保留生产者的最后序列号时，可以在日志中具有空批次。这里有一个奇怪的是，在压缩过程中，baseTimestamp字段不保留。所以如果批次中第一条消息被压缩，那么它将改变。

5.3.1.1 控制批次

控制批次包含一个称为控制记录的记录。控制记录不传递给应用程序。相反，它们被消费者用来过滤掉中止的事务消息。

控制记录的关键是符合以下模式：

version: int16 (current version is 0)
type: int16 (0 indicates an abort marker, 1 indicates a commit)

控制记录的值的模式取决于类型。该值对客户端是不透明的。

5.3.2 记录

记录级header在Kafka 0.11.0中引入。具有Header的记录的磁盘格式如下所示。

length: varint
attributes: int8
    bit 0~7: unused
timestampDelta: varint
offsetDelta: varint
keyLength: varint
key: byte[]
valueLen: varint
value: byte[]
Headers => [Header]

5.4.2.1 记录的header

headerKeyLength: varint
headerKey: String
headerValueLength: varint
Value: byte[]

我们使用与Protobuf相同的varint编码。有关后者的更多信息，请参见这里。记录中的header计数也被编码为一个varint。

kafka日志

5.5 日志

A log for a topic named "my_topic" with two partitions consists of two directories (namely my_topic_0 and my_topic_1) populated with data files containing the messages for that topic. The format of the log files is a sequence of "log entries""; each log entry is a 4 byte integer N storing the message length which is followed by the N message bytes. Each message is uniquely identified by a 64-bit integer offset giving the byte position of the start of this message in the stream of all messages ever sent to that topic on that partition. The on-disk format of each message is given below. Each log file is named with the offset of the first message it contains. So the first file created will be 00000000000.kafka, and each additional file will have an integer name roughly S bytes from the previous file where S is the max log file size given in the configuration.

假设有2个分区的主题“my_topic”，它将由2个目录构成（my_topic_0和my_topic_1），用于存放该主题消息的数据文件。日志文件的格式是一个“日志条目”序列。每条日志条目都由一个存储消息长度的4字节整型N和紧跟着的N字节消息组成。其中每条消息都有一个64位整型的唯一标识offset，offset（偏移量）代表了topic分区中所有消息流中该消息的起始字节位置。每条消息在磁盘上的格式如下：每个日志文件用第一条消息的offset来命名的，因此，创建的第一个文件将是00000000000.kafka，并且每个附加文件都将是上一个文件S字节的整数命名，其中S是配置中设置的最大日志文件大小。

The exact binary format for messages is versioned and maintained as a standard interface so message sets can be transfered between producer, broker, and client without recopying or conversion when desirable. This format is as follows:
消息是二进制格式并作为一个标准接口，所以消息可以在producer，broker，client之间传输，无需再copy或转换。格式如下:

On-disk format of a message

message length : 4 bytes (value: 1+4+n) 
"magic" value  : 1 byte
crc            : 4 bytes
payload        : n bytes

The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition. Since the offset is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.

使用消息offset作为消息id是不常见的，我们最初的想法是使用由生产者生成的GUID，并维护GUID到每个broker的offset映射。但是消费者必须维护每个服务ID，独一无二的GUID，另外，维护来自随机id的映射到一个offset的复杂度，需要一个非常复杂的索引结构，还必须与磁盘同步，基本上需要一个完整的持久性随机存储数据结构。因此，为了简化查找结构，我们决定使用一个简单的每个分区的原子计数器，它可以加上分区id和节点id来唯一标识一个消息；这使得查询结构更简单，虽然每个消费者仍然可能需要查找多个。然而，我们一旦选定了一个counter（计数器），直接跳到使用offset — 两者毕竟都是单纯递增到唯一的整数分区。由于offset在consumer API是隐藏的，这个最终的实现细节和我们用更有效的方法。

写

The log allows serial appends which always go to the last file. This file is rolled over to a fresh file when it reaches a configurable size (say 1GB). The log takes two configuration parameter M which gives the number of messages to write before forcing the OS to flush the file to disk, and S which gives a number of seconds after which a flush is forced. This gives a durability guarantee of losing at most M messages or S seconds of data in the event of a system crash.
日志允许串行的追加消息到最后的一个文件。当它达到配置文件中设置的大小（1GB），就会滚动新的文件上。日志采用了2个配置参数：M，它定义了强制OS刷新文件到磁盘之前主动写入的消息数量。S，它定义了几秒后强制刷新。这样提供了耐久性的保障。当系统崩溃时候，丢失最多M消息，或S秒的数据。

读取

Reads are done by giving the 64-bit logical offset of a message and an S-byte max chunk size. This will return an iterator over the messages contained in the S-byte buffer. S is intended to be larger than any single message, but in the event of an abnormally large message, the read can be retried multiple times, each time doubling the buffer size, until the message is read successfully. A maximum message and buffer size can be specified to make the server reject messages larger than some size, and to give a bound to the client on the maximum it need ever read to get a complete message. It is likely that the read buffer ends with a partial message, this is easily detected by the size delimiting.

读取是通过定义的64位逻辑的消息和S-byte块大小的offset来完成。返回一个迭代器，它包含在S-byte缓冲区的消息。S比单个消息大，但是在消息很大的情况下，读取可重试多次，每次的缓冲区大小加倍，直到消息被成功的读取。可以指定最大消息和缓冲区的大小，使服务器拒绝一些超过这个大小的消息。

The actual process of reading from an offset requires first locating the log segment file in which the data is stored, calculating the file-specific offset from the global offset value, and then reading from that file offset. The search is done as a simple binary search variation against an in-memory range maintained for each file.
从一个offset读取的实际过程中，首先需要在存储的数据中找出日志段文件，然后通过全局offset计算找到的日志段内的offset。然后从该文件的offset读取数据。搜索是通过二进制搜索每个文件在内存中的变化来完成的。

The log provides the capability of getting the most recently written message to allow clients to start subscribing as of "right now". This is also useful in the case the consumer fails to consume its data within its SLA-specified number of days. In this case when the client attempts to consume a non-existant offset it is given an OutOfRangeException and can either reset itself or fail as appropriate to the use case.
日志提供获取最新写的消息来允许客户端开始在“right now”订阅的能力，这是在其SLA指定的天数内未消费的情况下是很有用的。在这种情况下，当客户端尝试去消费一个不存在的offset，将报OutOfRangeException，并重置它自己，或在适当的情况下直接失败。

The following is the format of the results sent to the consumer.
下面是发送给消费者的结果的格式。

MessageSetSend (fetch result)

total length     : 4 bytes
error code       : 2 bytes
message 1        : x bytes
...
message n        : x bytes

MultiMessageSetSend (multiFetch result)

total length       : 4 bytes
error code         : 2 bytes
messageSetSend 1
...
messageSetSend n

删除

Data is deleted one log segment at a time. The log manager allows pluggable delete policies to choose which files are eligible for deletion. The current policy deletes any log with a modification time of more than N days ago, though a policy which retained the last N GB could also be useful. To avoid locking reads while still allowing deletes that modify the segment list we use a copy-on-write style segment list implementation that provides consistent views to allow a binary search to proceed on an immutable static snapshot view of the log segments while deletes are progressing.
数据删除在一个时间的日志段。日志管理器允许插入删除策略来选择删除哪些文件，目前的策略删除N天以前日志（修改时间），虽然它保留了最后的N，GB也可能是有用策略。为了避免锁定读取，同时仍然允许删除和修改段列表，我们使用一个copy-on-write风格的段列表实现，提供一致的视图来允许一个二叉搜索进行一个不变的日志段的静态快照视图同时进行删除。

保障（Guarantees）

The log provides a configuration parameter M which controls the maximum number of messages that are written before forcing a flush to disk. On startup a log recovery process is run that iterates over all messages in the newest log segment and verifies that each message entry is valid. A message entry is valid if the sum of its size and offset are less than the length of the file AND the CRC32 of the message payload matches the CRC stored with the message. In the event corruption is detected the log is truncated to the last valid offset.

日志提供了一个配置参数M，它来控制消息在强制刷新到磁盘之前，就写入磁盘的最大数。在启动日志恢复进程运行，超过了最新的日志段的所有消息迭代并验证每条消息是有效的。如果消息的总大小并且offset小于文件的长度和消息有效负载CRC32匹配消息存储CRC的消息条目，则是有效的。如果检索到脏日志则截取最后有效的offset。

Note that two kinds of corruption must be handled: truncation in which an unwritten block is lost due to a crash, and corruption in which a nonsense block is ADDED to the file. The reason for this is that in general the OS makes no guarantee of the write order between the file inode and the actual block data so in addition to losing written data the file can gain nonsense data if the inode is updated with a new size but a crash occurs before the block containing that data is not written. The CRC detects this corner case, and prevents it from corrupting the log (though the unwritten messages are, of course, lost).
需要注意的是两种腐败必须处理：截断由崩溃导致未写入的块丢失。无意义的块被添加到文件的脏数据，这么做的原因是，在一般的操作系统是没有文件节点和实际数据块之间写入顺序的保障，所以除了丢失写入的数据，如果该节点新size更新，但块包含写入前崩溃产生的无用数据。CRC发现这种问题，并阻止脏数据（虽然未写入消息，当然，丢失）

kafka分布

消费者offset跟踪（Consumer Offset Tracking）

The high-level consumer tracks the maximum offset it has consumed in each partition and periodically commits its offset vector so that it can resume from those offsets in the event of a restart. Kafka provides the option to store all the offsets for a given consumer group in a designated broker (for that group) called the offset manager. i.e., any consumer instance in that consumer group should send its offset commits and fetches to that offset manager (broker). The high-level consumer handles this automatically. If you use the simple consumer you will need to manage offsets manually. This is currently unsupported in the Java simple consumer which can only commit or fetch offsets in ZooKeeper. If you use the Scala simple consumer you can discover the offset manager and explicitly commit or fetch offsets to the offset manager. A consumer can look up its offset manager by issuing a ConsumerMetadataRequest to any Kafka broker and reading the ConsumerMetadataResponse which will contain the offset manager. The consumer can then proceed to commit or fetch offsets from the offsets manager broker. In case the offset manager moves, the consumer will need to rediscover the offset manager. If you wish to manage your offsets manually, you can take a look at these code samples that explain how to issue OffsetCommitRequest and OffsetFetchRequest.

高级别消费者跟踪每个分区已消费的最大的offset，并定期提交offset，在重新启动的情况下，可从这些offset恢复。Kafka提供了一个选项在指定的broker中来存储所有给定的消费者组的offset，称为offset manager。例如，该消费者组的所有消费者实例发送其offset，提交并获取该offset manager（broker）。高级别消费者都将会自动处理这些。如果你使用低级别的消费者，你将需要去手动管理offset。目前在低级别的java消费者不支持，只能在Zookeeper提交或获取offset。如果你使用简单的Scala消费者，你可拿到offset manager，并显式的提交或获取offset。消费者可以通过发送GroupCoordinatorRequest到任何的broker，并接受GroupCoordinatorResponse响应对象，对象包含offset manager，那么消费者可以继续从`offset manager broker`提交或获取offset。如果offset manager位置变动，消费者需要重新发现offset manager。如果你想手动管理你的offset，你可以看看OffsetCommitRequest 和 OffsetFetchRequest如何做的。

When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. The offset manager sends a successful offset commit response to the consumer only after all the replicas of the offsets topic receive the offsets. In case the offsets fail to replicate within a configurable timeout, the offset commit will fail and the consumer may retry the commit after backing off. (This is done automatically by the high-level consumer.) The brokers periodically compact the offsets topic since it only needs to maintain the most recent offset commit per partition. The offset manager also caches the offsets in an in-memory table in order to serve offset fetches quickly.

当offset manager接收一个OffsetCommitRequest，它追加请求到一个特定的压缩的名为__consumer_offsets的topic中，当offset topic的所有副本接收offset之后，offset manager发送一个成功的offset提交响应给消费者。万一offset无法在规定的时间内复制，offset将提交失败，消费者在回退之后可重试该提交（高级别消费者自动做的）。broker定期压缩offset topic，只需要保存每个分区最近的offset。offset manager也缓存offset在内存表中，以便offset快速获取。

When the offset manager receives an offset fetch request, it simply returns the last committed offset vector from the offsets cache. In case the offset manager was just started or if it just became the offset manager for a new set of consumer groups (by becoming a leader for a partition of the offsets topic), it may need to load the offsets topic partition into the cache. In this case, the offset fetch will fail with an OffsetsLoadInProgress exception and the consumer may retry the OffsetFetchRequest after backing off. (This is done automatically by the high-level consumer.)

当offset manager接收一个offset的获取请求，将从offset缓存中返回最新的的offset。如果offset manager刚启动或新的消费者组集刚成为offset manager（成为offset topic分区的leader），则需要加载offset topic的分区到缓存中，在这种情况下，offset将获取失败，并报出OffsetsLoadInProgress异常，消费者可后退后，重试OffsetFetchRequest（高级别消费者自动做这些）。

迁移offset从zookeeper到kafka

Kafka consumers in earlier releases store their offsets by default in ZooKeeper. It is possible to migrate these consumers to commit offsets into Kafka by following these steps:

Kafka消费者在早先的版本中offset默认存储在ZooKeeper。可以通过下面的步骤迁移这些消费者到Kafka。

Set offsets.storage=kafka and dual.commit.enabled=true in your consumer config.
在消费者配置设置offsets.storage=kafka和dual.commit.enabled=true。
Do a rolling bounce of your consumers and then verify that your consumers are healthy.
消费者做滚动消费，验证你的消费者是健康的。
Set dual.commit.enabled=false in your consumer config.
在你的消费者配置设置dual.commit.enabled=false
Do a rolling bounce of your consumers and then verify that your consumers are healthy.
消费者做滚动消费，验证你的消费者是健康的。

A roll-back (i.e., migrating from Kafka back to ZooKeeper) can also be performed using the above steps if you setoffsets.storage=zookeeper.

回滚（就是从kafka回到Zookeeper）也可以使用上面的步骤，通过设置 offsets.storage=zookeeper

ZooKeeper目录

The following gives the ZooKeeper structures and algorithms used for co-ordination between consumers and brokers.

下面给出了Zookeeper的结构和算法，用于协调消费者和经纪人之间。

符号（Notation）

When an element in a path is denoted [xyz], that means that the value of xyz is not fixed and there is in fact a ZooKeeper znode for each possible value of xyz. For example /topics/[topic] would be a directory named /topics containing a sub-directory for each topic name. Numerical ranges are also given such as [0...5] to indicate the subdirectories 0, 1, 2, 3, 4. An arrow -> is used to indicate the contents of a znode. For example /hello -> world would indicate a znode /hello containing the value "world".

当一个path中的元素表示为[XYZ]，这意味着xyz的值不是固定的，实际上每个xyz的值可能是Zookeeper的znode，例如`/topic/[topic]` 是一个目录，/topic包含一个子目录(每个topic名称)。数字的范围如[0...5]来表示子目录0，1，2，3，4。箭头`->`用于表示znode的内容，例如 /hello-> world 表示znode /hello包含值”world”。

Broker Node Registry

/brokers/ids/[0...N] --> host:port (ephemeral node)

This is a list of all present broker nodes, each of which provides a unique logical broker id which identifies it to consumers (which must be given as part of its configuration). On startup, a broker node registers itself by creating a znode with the logical broker id under /brokers/ids. The purpose of the logical broker id is to allow a broker to be moved to a different physical machine without affecting consumers. An attempt to register a broker id that is already in use (say because two servers are configured with the same broker id) is an error.

这是所有当前broker的节点列表，其中每个提供了一个唯一的逻辑broker的id标识它的消费者（必须作为配置的一部分）。在启动时，broker节点通过在/brokers/ids/下用逻辑broker id创建一个znode来注册它自己。逻辑broker id的目的是当broker移动到不同的物理机器，而不会影响消费者。尝试注册一个已存在的broker id时将返回错误（因为2个server配置了相同的broker id）。

Since the broker registers itself in ZooKeeper using ephemeral znodes, this registration is dynamic and will disappear if the broker is shutdown or dies (thus notifying consumers it is no longer available).

由于broker在Zookeeper中用的是临时znode，这个注册是动态的，如果broker关闭或宕机，节点将消失（通知消费者不在可用）。

Broker Topic Registry

/brokers/topics/[topic]/[0...N] --> nPartions (ephemeral node)

Each broker registers itself under the topics it maintains and stores the number of partitions for that topic.

每个broker在它自己的topic下注册，维护和存储该topic分区的数据。

Consumers and Consumer Groups

Consumers of topics also register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of data. Consumers can also store their offsets in ZooKeeper by settingoffsets.storage=zookeeper. However, this offset storage mechanism will be deprecated in a future release. Therefore, it is recommended to migrate offsets storage to Kafka.

topic的消费者也在zookeeper注册他们自己，以便相互协调和平衡数据的消耗，消费者还可以通过设置offsets.storage=zookeeper来存储offset，但是，这个机制在未来的版本将会弃用。因此，建议迁移数据到kafka。

Multiple consumers can form a group and jointly consume a single topic. Each consumer in the same group is given a shared group_id. For example if one consumer is your foobar process, which is run across three machines, then you might assign this group of consumers the id "foobar". This group id is provided in the configuration of the consumer, and is your way to tell the consumer which group it belongs to.
多个消费者可组成一组，共同消费一个topic，在同一组中的每个消费者共享一个group_id。例如，如果一个消费者是foobar，在三个机器上运行，你可能分配这个这个消费组的ID是“foobar”。这个组id是在消费者的配置文件中配置。

The consumers in a group divide up the partitions as fairly as possible, each partition is consumed by exactly one consumer in a consumer group.
每个分区正好被一个消费者组的消费者所消费，一组中的消费者尽可能公平地分配分区。

Consumer Id Registry

In addition to the group_id which is shared by all consumers in a group, each consumer is given a transient, unique consumer_id (of the form hostname:uuid) for identification purposes. Consumer ids are registered in the following directory.
除了由所有消费者共享的group_id，每个消费者都有一个临时，唯一的consumer_id（主机名的形式:uuid）用于识别。消费者的id在注册到在以下目录中。

/consumers/[group_id]/ids/[consumer_id] --> {"topic1": #streams, ..., "topicN": #streams} (ephemeral node)

Each of the consumers in the group registers under its group and creates a znode with its consumer_id. The value of the znode contains a map of <topic, #streams>. This id is simply used to identify each of the consumers which is currently active within a group. This is an ephemeral node so it will disappear if the consumer process dies.
组中的每个消费者用consumer_id注册znode。znode的值包含一个map<topic,#streams>。这个id只是用来识别在组里目前活跃的消费者，这是个临时节点，如果消费者在处理中挂掉，它就会消失。

Consumer Offsets

Consumers track the maximum offset they have consumed in each partition. This value is stored in a ZooKeeper directory ifoffsets.storage=zookeeper. This valued is stored in a ZooKeeper directory.
消费者追踪它们在每个分区消费的最大offset，如果offsets.storage=zookeeper，那此值就存在zookeeper的目录中。

/consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value ((persistent node)

Partition Owner registry

Each broker partition is consumed by a single consumer within a given consumer group. The consumer must establish its ownership of a given partition before any consumption can begin. To establish its ownership, a consumer writes its own id in an ephemeral node under the particular broker partition it is claiming.
每个broker分区由一个消费者组里的单个消费者消费，任何消费者开始消费之前，消费者必须建立其所有权，为了建立所有权，消费者写入自己的ID到临时节点。

/consumers/[group_id]/owners/[topic]/[broker_id-partition_id] --> consumer_node_id (ephemeral node)

Cluster Id（集群ID）

The cluster id is a unique and immutable identifier assigned to a Kafka cluster. The cluster id can have a maximum of 22 characters and the allowed characters are defined by the regular expression [a-zA-Z0-9_\-]+, which corresponds to the characters used by the URL-safe Base64 variant with no padding. Conceptually, it is auto-generated when a cluster is started for the first time.

集群ID是Kafka集群中唯一且不可变的标识符。集群ID最多可以包含22个字符，对应于没有填充的URL安全Base64变量所使用的字符，允许通过正则表达式 [a-zA-Z0-9 _ \]]定义。从概念上讲，当第一次启动集群时，集群ID就会自动生成。

Implementation-wise, it is generated when a broker with version 0.10.1 or later is successfully started for the first time. The broker tries to get the cluster id from the /cluster/id znode during startup. If the znode does not exist, the broker generates a new cluster id and creates the znode with this cluster id.

实际上，当第一次成功启动时生成（0.10.1或更高版本）。broker尝试在启动期间从 /cluster/id znode获取集群ID。如果znode不存在，broker将生成一个新的集群ID，并使用此集群ID创建znode。

Broker node registration

The broker nodes are basically independent, so they only publish information about what they have. When a broker joins, it registers itself under the broker node registry directory and writes information about its host name and port. The broker also register the list of existing topics and their logical partitions in the broker topic registry. New topics are registered dynamically when they are created on the broker.
broker节点基本上都是独立的，所以它们只发布有关它们的信息，当broker连接，注册broker节点到其注册目录下，并写入它的host name和prot信息。broker还注册了其注册的现有topic和逻辑分区的列表。当创建一个新topic，就在broker上动态的注册新的topic。

Consumer registration algorithm

When a consumer starts, it does the following:
当一个消费者启动，它会做以下几步：

Register itself in the consumer id registry under its group.
在消费者id登记处组下注册它自己
Register a watch on changes (new consumers joining or any existing consumers leaving) under the consumer id registry. (Each change triggers rebalancing among all consumers within the group to which the changed consumer belongs.)
在消费ID登记处注册一个观察者（新加入的消费者或任何现有的消费者离开）。（每次变化就会触发，改变所属的组内的所有消费者重新平衡。）
Register a watch on changes (new brokers joining or any existing brokers leaving) under the broker id registry. (Each change triggers rebalancing among all consumers in all consumer groups.)
在broker的id登陆处注册一个观察者（新加入的broker和现有的离开）。（每次改变触发，在所有消费者组的所有消费者之间重新平衡。）
If the consumer creates a message stream using a topic filter, it also registers a watch on changes (new topics being added) under the broker topic registry. (Each change will trigger re-evaluation of the available topics to determine which topics are allowed by the topic filter. A new allowed topic will trigger rebalancing among all consumers within the consumer group.)
如果消费者创建一个消费流用于topic过滤器，并在broker topic登记处注册一个观察者（监听新添加的topic）。（每次变化将触发可用的topic进行重新评估，以确定哪些topic是过滤器允许的。新的topic将触发消费者组中所有消费者之间进行重新平衡）
Force itself to rebalance within in its consumer group.
在其消费者组中进行强制重新平衡。

消费者再平衡算法（Consumer rebalancing algorithm）

The consumer rebalancing algorithms allows all the consumers in a group to come into consensus on which consumer is consuming which partitions. Consumer rebalancing is triggered on each addition or removal of both broker nodes and other consumers within the same group. For a given topic and a given consumer group, broker partitions are divided evenly among consumers within the group. A partition is always consumed by a single consumer. This design simplifies the implementation. Had we allowed a partition to be concurrently consumed by multiple consumers, there would be contention on the partition and some kind of locking would be required. If there are more consumers than partitions, some consumers won't get any data at all. During rebalancing, we try to assign partitions to consumers in such a way that reduces the number of broker nodes each consumer has to connect to.
消费者再平衡算法允许组中所有的消费者消费哪一个分区达成共识，同组中的broker和其他的消费者的每一次增加或移除触发消费者再平衡。对于一个给定的topic和给定的消费者组，组内的消费者之间均匀的分配broker分区。如果我们允许一个分区被多个消费者共同消费，这需要锁了，所有我们设计一个分区永远只有一个消费者进行消费。这样设计简化了很多。如果消费者比分区多，那么一些消费者将不会获得任何数据。在再平衡期间，我们试图分配分区给消费者。以这样的方式来减少每个消费者连接到broker的节点数。

Each consumer does the following during rebalancing:
以下是每个消费者再平衡的过程：

   1. For each topic T that Ci subscribes to 
   2.   let PT be all partitions producing topic T
   3.   let CG be all consumers in the same group as Ci that consume topic T
   4.   sort PT (so partitions on the same broker are clustered together)
   5.   sort CG
   6.   let i be the index position of Ci in CG and let N = size(PT)/size(CG)
   7.   assign partitions from i*N to (i+1)*N - 1 to consumer Ci
   8.   remove current entries owned by Ci from the partition owner registry
   9.   add newly assigned partitions to the partition owner registry  (we may need to re-try this until the original partition owner releases its ownership)

When rebalancing is triggered at one consumer, rebalancing should be triggered in other consumers within the same group about the same time.

当一个消费者的再平衡被触发时，在同一时间内，相同的组内的其他消费者也会被触发。

kafka添加和修改topic

You have the option of either adding topics manually or having them be created automatically when data is first published to a non-existent topic. If topics are auto-created then you may want to tune the default topic configurations used for auto-created topics.
如果你第一次发布一个不存在的topic时，它会自动创建。你也可以手动添加topic。

Topics are added and modified using the topic tool:
topic的添加和修改使用下面的工具。

 > bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y

The replication factor controls how many servers will replicate each message that is written. If you have a replication factor of 3 then up to 2 servers can fail before you will lose access to your data. We recommend you use a replication factor of 2 or 3 so that you can transparently bounce machines without interrupting data consumption.
副本控制每个消息在服务器中的备份，如果有3个副本，那么最多允许有2个节点宕掉才能不丢数据,集群中推荐设置2或3个副本，才不会中断数据消费。

The partition count controls how many logs the topic will be sharded into. There are several impacts of the partition count. First each partition must fit entirely on a single server. So if you have 20 partitions the full data set (and read and write load) will be handled by no more than 20 servers (no counting replicas). Finally the partition count impacts the maximum parallelism of your consumers. This is discussed in greater detail in the concepts section.

分区数控制topic将分片成多少log。关于分区数的影响，首先每个分区必须完整的存储在单个的服务器上。因此，如果你有20个分区的话(读和写的负载)，那么完整的数据集将不超过20个服务器（不计算备份）。最后，分区数影响消费者的最大并发。这个在概念章节里进行更详细的讨论。

The configurations added on the command line override the default settings the server has for things like the length of time data should be retained. The complete set of per-topic configurations is documented here.

命令行上添加的配置覆盖了服务器的默认设置，服务器有关于时间长度的数据，应该保留。这里记录了每个主题的完整配置。

kafka修改删除topic

You can change the configuration or partitioning of a topic using the same topic tool.

你可以使用同样的topic工具更改topic的配置和分区。

To add partitions you can do
你可以添加分区

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name  --partitions 40

Be aware that one use case for partitions is to semantically partition data, and adding partitions doesn't change the partitioning of existing data so this may disturb consumers if they rely on that partition. That is if data is partitioned byhash(key) % number_of_partitionsthen this partitioning will potentially be shuffled by adding partitions but Kafka will not attempt to automatically redistribute data in any way.

要知道，一个用例的分区是语义上的分区数据，添加分区不能改变现有的数据，如果分区被使用中，这就可能会扰乱消费者。也就是说如果数据通过哈希(key)number_of_partitions划分，那么该分区将通过添加分区进行洗牌，但kafka不以任何方式自动分配数据。

To add configs:

添加配置：

 bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y

To remove a config:

移除配置：

 bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --deleteConfig x

And finally deleting a topic:

最后删除主题：

 bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name

Topic deletion option is disabled by default. To enable it set the server config

主题删除选项默认是关闭的，设置服务器配置开启它。

delete.topic.enable=true

Kafka does not currently support reducing the number of partitions for a topic or changing the replication factor.

kafka目前不支持减少分区数和改变备份数。

如果想彻底删除数据节点，可以看这篇文章

优雅的关闭kafka

The Kafka cluster will automatically detect any broker shutdown or failure and elect new leaders for the partitions on that machine. This will occur whether a server fails or it is brought down intentionally for maintenance or configuration changes. For the later cases Kafka supports a more graceful mechanism for stoping a server then just killing it. When a server is stopped gracefully it has two optimizations it will take advantage of:

Kafka集群自动检测broker关闭或者失败，并且在该机器上的分区选举新的leaders，当服务器出现故障或故意进行维护或配置更改时，为这种情况，Kafka支持一个更优雅机制关闭服务然后kill它，当服务器正常停止，它有2个最佳：

It will sync all its logs to disk to avoid needing to do any log recovery when it restarts (i.e. validating the checksum for all messages in the tail of the log). Log recovery takes time so this speeds up intentional restarts.
它把所有日志同步到磁盘里，当重启时，以避免需要做任何的日志恢复。日志恢复需要时间，所以这样可以加快有意启动。
It will migrate any partitions the server is the leader for to other replicas prior to shutting down. This will make the leadership transfer faster and minimize the time each partition is unavailable to a few milliseconds.
在关闭之间，它将所有leader分区服务器移动到其他的副本，并且把每个分区不可用的几毫秒的时间降至更低。

Syncing the logs will happen automatically happen whenever the server is stopped other than by a hard kill, but the controlled leadership migration requires using a special setting:

当发生服务器停止不是通过直接kill，就会自动同步日志，但是leader迁移需要使用特殊的设置：

    controlled.shutdown.enable=true

Note that controlled shutdown will only succeed if all the partitions hosted on the broker have replicas (i.e. the replication factor is greater than 1 and at least one of these replicas is alive). This is generally what you want since shutting down the last replica would make that topic partition unavailable.

注意，控制关闭broker和副本上的所有分区才行（即，副本大于1并且这些副本至少一个活着）。这通常因为你关闭最后一个副本将使这个主题分区不可用。

kafka平衡leader

Whenever a broker stops or crashes leadership for that broker's partitions transfers to other replicas. This means that by default when the broker is restarted it will only be a follower for all its partitions, meaning it will not be used for client reads and writes.

当一个broker停止或崩溃时，这个broker中所有分区的leader将转移给其他副本。这意味着在默认情况下，当这个broker重启之后，它的所有分区都将仅作为follower，不再用于客户端的读写操作。

To avoid this imbalance, Kafka has a notion of preferred replicas. If the list of replicas for a partition is 1,5,9 then node 1 is preferred as the leader to either node 5 or 9 because it is earlier in the replica list. You can have the Kafka cluster try to restore leadership to the restored replicas by running the command:
为了避免这种不平衡，Kafka有一个首选副本的概念。如果一个分区的副本列表是1，5，9，节点1将优先作为其他两个副本5和9的leader，因为它较早存在于副本中。你可以通过运行以下命令让Kafka集群尝试恢复已恢复正常的副本的leader地位：

 > bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot

Since running this command can be tedious you can also configure Kafka to do this automatically by setting the following configuration:

手动运行很无趣，你可以通过这个配置设置为自动执行：

    auto.leader.rebalance.enable=true

kafka镜像集群之间的数据

镜像集群之间的数据

We refer to the process of replicating data between Kafka clusters "mirroring" to avoid confusion with the replication that happens amongst the nodes in a single cluster. Kafka comes with a tool for mirroring data between Kafka clusters. The tool reads from one or more source clusters and writes to a destination cluster, like this:

我们指的是kafka集群之间复制数据“镜像”，为避免在单个集群中的节点之间发生复制混乱的。kafka附带了kafka集群之间的镜像数据的工具。该工具从一个源集群读取和写入到目标集群，像这样：

A common use case for this kind of mirroring is to provide a replica in another datacenter. This scenario will be discussed in more detail in the next section.

常见的用例是镜像在另一个数据中心提供一个副本。这种方案的将在下一节详细讨论。

You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take overs the additional load).

你可以运行很多这样的镜像进程来提高吞吐和容错性（如果某个进程挂了，则其他的进程会接管）

Data will be read from topics in the source cluster and written to a topic with the same name in the destination cluster. In fact the mirror maker is little more than a Kafka consumer and producer hooked together.

数据从源集群中的topic读取并将其写入到目标集群中相名的topic。事实上，镜像制作不比消费者和生产者连接要好。

The source and destination clusters are completely independent entities: they can have different numbers of partitions and the offsets will not be the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be different); for that we recommend using normal in-cluster replication. The mirror maker process will, however, retain and use the message key for partitioning so order is preserved on a per-key basis.

源和目标集群是完全独立的实体：分区数和offset可以都不相同，就是因为这个原因，镜像集群并不是真的打算作为一个容错机制（消费者位置是不同的），为此，我们推荐使用正常的集群复制。然而，镜像制造将保留和使用分区的消息key，以便每个键基础上保存顺序。

Here is an example showing how to mirror a single topic (named my-topic) from two input clusters:

下面是一个示例演示如何从两个输入集群镜像到一个topic（名为：my-topic）：

 > bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer-1.properties --consumer.config consumer-2.properties --producer.config producer.properties --whitelist my-topic

Note that we specify the list of topics with the--whitelist option. This option allows any regular expression using Java-style regular expressions. So you could mirror two topics named A and B using--whitelist 'A|B'. Or you could mirror all topics using--whitelist '*'. Make sure to quote any regular expression to ensure the shell doesn't try to expand it as a file path. For convenience we allow the use of ',' instead of '|' to specify a list of topics.

注意，我们用 --whitelist 选项指定topic列表。此选项允许使用java风格的正则表达式。所以你可以使用--whitelist 'A|B' ，A和B是镜像名。或者你可以镜像所有topic。也可以使用--whitelist ‘*’镜像所有topic，为了确保引用的正则表达式不会被shell认为是一个文件路径，我们允许使用‘,’ 而不是’|’指定topic列表。

Sometime it is easier to say what it is that you don't want. Instead of using--whitelist to say what you want to mirror you can use--blacklist to say what to exclude. This also takes a regular expression argument.

你可以很容易的排除哪些是不需要的，可以用--blacklist来排除，目前--new.consumer不支持。

Combining mirroring with the configuration auto.create.topics.enable=true makes it possible to have a replica cluster that will automatically create and replicate all data in a source cluster even as new topics are added.

镜像结合配置auto.create.topics.enable=true，这样副本集群就会自动创建和复制。

kafka检查消费者位置

检查消费者的位置（Checking consumer position）

Sometimes it's useful to see the position of your consumers. We have a tool that will show the position of all consumers in a consumer group as well as how far behind the end of the log they are. To run this tool on a consumer group named my-group consuming a topic named my-topic would look like this:
有时候需要去查看你的消费者的位置。我们有一个显示【消费者组中】所有消费者的位置的工具。显示日志其落后多远。消费者组名为my-group，消费者topic名为my-topic，如下：

> bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group test

Group           Topic                          Pid Offset          logSize         Lag             Owner
my-group        my-topic                       0   0               0               0               test_jkreps-mn-1394154511599-60744496-0
my-group        my-topic                       1   0               0               0               test_jkreps-mn-1394154521217-1a0be913-0

NOTE: Since 0.9.0.0, the kafka.tools.ConsumerOffsetChecker tool has been deprecated. You should use the kafka.admin.ConsumerGroupCommand (or the bin/kafka-consumer-groups.sh script) to manage consumer groups, including consumers created with the new consumer API.
注意：在0.9.0.0，kafka.tools.ConsumerOffsetChecker已经不支持了。你应该使用kafka.admin.ConsumerGroupCommand 或 bin/kafka-consumer-groups.sh脚本来管理消费者组，包括用新消费者API创建的消费者。

## 0.9+
bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group test-consumer-group

## 0.10+
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group

Managing Consumer Groups（管理消费者组）

With the ConsumerGroupCommand tool, we can list, describe, or delete consumer groups. Note that deletion is only available when the group metadata is stored in ZooKeeper. When using the new consumer API (where the broker handles coordination of partition handling and rebalance), the group is deleted when the last committed offset for that group expires. For example, to list all consumer groups across all topics:
用ConsumerGroupCommand工具，我们可以使用list，describe，或delete消费者组（注意，删除只有在分组元数据存储在zookeeper的才可用）。当使用新消费者API（broker协调处理分区和重新平衡），当该组的最后一个提交的偏移到期时，该组被删除。例如，要列出所有主题中的所有用户组：

> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --list

 test-consumer-group

To view offsets as in the previous example with the ConsumerOffsetChecker, we "describe" the consumer group like this:
要使用ConsumerOffsetChecker查看上一个示例中消费者组的偏移量，我们按如下所示“describe”消费者组：

> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group

GROUP                          TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
test-consumer-group            test-foo                       0          1               3               2               consumer-1_/127.0.0.1

There are a number of additional "describe" options that can be used to provide more detailed information about a consumer group:
还有一切其他的命令可以提供消费组更多详细信息：

-members: This option provides the list of all active members in the consumer group.
-members: 此选项提供使用者组中所有活动成员的列表。

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group --members

CONSUMER-ID                                    HOST            CLIENT-ID       #PARTITIONS
consumer1-3fc8d6f1-581a-4472-bdf3-3515b4aee8c1 /127.0.0.1      consumer1       2
consumer4-117fe4d3-c6c1-4178-8ee9-eb4a3954bee0 /127.0.0.1      consumer4       1
consumer2-e76ea8c3-5d30-4299-9005-47eb41f3d3c4 /127.0.0.1      consumer2       3
consumer3-ecea43e4-1f01-479f-8349-f9130b75d8ee /127.0.0.1      consumer3       0

--members --verbose: On top of the information reported by the "--members" options above, this option also provides the partitions assigned to each member.
除了“--members”展示信息之外，此选项还展示分配给每个成员的分区。

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group --members --verbose

CONSUMER-ID                                    HOST            CLIENT-ID       #PARTITIONS     ASSIGNMENT
consumer1-3fc8d6f1-581a-4472-bdf3-3515b4aee8c1 /127.0.0.1      consumer1       2               topic1(0), topic2(0)
consumer4-117fe4d3-c6c1-4178-8ee9-eb4a3954bee0 /127.0.0.1      consumer4       1               topic3(2)
consumer2-e76ea8c3-5d30-4299-9005-47eb41f3d3c4 /127.0.0.1      consumer2       3               topic2(1), topic3(0,1)
consumer3-ecea43e4-1f01-479f-8349-f9130b75d8ee /127.0.0.1      consumer3       0               -

--offsets: This is the default describe option and provides the same output as the "--describe" option.
--offsets：默认的describe选项，与“--describe”选项相同的输出。

--state: This option provides useful group-level information.
--state：此选项提供有用的组级别信息

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group --state

COORDINATOR (ID)          ASSIGNMENT-STRATEGY       STATE                #MEMBERS
localhost:9092 (0)        range                     Stable               4

To manually delete one or multiple consumer groups, the "--delete" option can be used:
要手动删除一个或多个消费者组，可以使用“--delete”：

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --delete --group my-group --group my-other-group

 Deletion of requested consumer groups ('my-group', 'my-other-group') was successful.

To reset offsets of a consumer group, "--reset-offsets" option can be used. This option supports one consumer group at the time. It requires defining following scopes: --all-topics or --topic. One scope must be selected, unless you use '--from-file' scenario. Also, first make sure that the consumer instances are inactive. See KIP-122 for more details.
要重置消费者组的offset，可以使用“--reset-offsets”选项。此选项同一时间只支持一个消费者组操作。它需要定义以下范围：--all-topics 或 --topic。除非您使用"--from-file"方案，否则必须选择一个范围。另外，首先请确保消费者实例处于非活动状态。有关更多详细信息，请参见KIP-122。

It has 3 execution options:
它有3个执行操作命令选项：

(default) to display which offsets to reset.
显示要重置的offset
--execute : to execute --reset-offsets process.
执行--reset-offsets处理
--export : to export the results to a CSV format.
--export：将结果导出为CSV格式。

--reset-offsets also has following scenarios to choose from (atleast one scenario must be selected):
--reset-offsets 还具有以下场景可供选择（必须选择至少一个场景）：

--to-datetime : Reset offsets to offsets from datetime. Format: 'YYYY-MM-DDTHH:mm:SS.sss'
--reset-offsets ：将offset重置为与日期时间的offset。格式：'YYYY-MM-DDTHH:mm:SS.sss'
--to-earliest : Reset offsets to earliest offset.
--to-earliest : 将offset重置为最早的offset。
--to-latest : Reset offsets to latest offset.
--to-latest : 将offsets重置为最新的offsets。
--shift-by : Reset offsets shifting current offset by 'n', where 'n' can be positive or negative.
--shift-by : 重置offsets，通过移位“n”，其中“ n”可以为正或负。
--from-file : Reset offsets to values defined in CSV file.
- --from-file : 将offset重置为CSV文件中定义的值。
--to-current : Resets offsets to current offset.
--to-current : 将offset重置为当前的offset。
--by-duration : Reset offsets to offset by duration from current timestamp. Format: 'PnDTnHnMnS'
--by-duration : 将offset重置为从当前时间戳重置为持续时间offset。格式：“ PnDTnHnMnS”
--to-offset : Reset offsets to a specific offset.
--to-offset : 将offset重置为指定的。

Please note, that out of range offsets will be adjusted to available offset end. For example, if offset end is at 10 and offset shift request is of 15, then, offset at 10 will actually be selected.
请注意，超出范围的offset将调整为可用的offset。例如，如果offset最大为10，设置为15时，则实际上将选择offset将为10。

For example, to reset offsets of a consumer group to the latest offset:
例如，要将消费者组的offset重置为最新的offset：

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --group consumergroup1 --topic topic1 --to-latest

 TOPIC                          PARTITION  NEW-OFFSET
topic1                         0          0

If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e. offsets.storage=zookeeper), pass --zookeeper instead of bootstrap-server:
如果你使用是老的高级消费者并在zookeeper存储消费者组的元数据（即。offsets.storage=zookeeper），则通过--zookeeper，而不是bootstrap-server

> bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list

kafka扩大集群

Adding servers to a Kafka cluster is easy, just assign them a unique broker id and start up Kafka on your new servers. However these new servers will not automatically be assigned any data partitions, so unless partitions are moved to them they won't be doing any work until new topics are created. So usually when you add machines to your cluster you will want to migrate some existing data to these machines.

增加新服务到kafka集群是很容易的，只要为新服务分配一个独一无二的Broker ID并启动即可。但是，新的服务不会自动分配到任何数据，需要把分区数据迁移给它们，在此期间它们一直不工作，直到新的topic创建，所以，通常向集群添加机器时，你需要将一些现有的数据迁移到这些机器上。

The process of migrating data is manually initiated but fully automated. Under the covers what happens is that Kafka will add the new server as a follower of the partition it is migrating and allow it to fully replicate the existing data in that partition. When the new server has fully replicated the contents of this partition and joined the in-sync replica one of the existing replicas will delete their partition's data.

迁移数据的过程是手动启动的，但是执行过程是完全自动化的。在kafka后台内部中，kafka将添加新的服务器，并作为正在迁移分区的follower，来完全复制该分区现有的数据。当新服务器完全复制该分区的内容并加入同步副本，成为现有副本之一后，就将现有的副本分区上的数据删除。

The partition reassignment tool can be used to move partitions across brokers. An ideal partition distribution would ensure even data load and partition sizes across all brokers. In 0.8.1, the partition reassignment tool does not have the capability to automatically study the data distribution in a Kafka cluster and move partitions around to attain an even load distribution. As such, the admin has to figure out which topics or partitions should be moved around.

分区重新分配工具可以用于跨broker迁移分区，理想的分区分配将确保所有的broker数据负载和分区大小。分区分配工具没有自动研究kafka集群的数据分布和迁移分区达到负载分布的能力，因此，管理员要弄清楚哪些topic或分区应该迁移。

The partition reassignment tool can run in 3 mutually exclusive modes -

分区分配工具的3种模式 -

--generate: In this mode, given a list of topics and a list of brokers, the tool generates a candidate reassignment to move all partitions of the specified topics to the new brokers. This option merely provides a convenient way to generate a partition reassignment plan given a list of topics and target brokers.
--generate: 这个选项命令，是生成分配规则json文件的，生成“候选人”重新分配到指定的topic的所有parition都移动到新的broker。此选项，仅提供了一个方便的方式来生成特定的topic和目标broker列表的分区重新分配 “计划”。
--execute: In this mode, the tool kicks off the reassignment of partitions based on the user provided reassignment plan. (using the --reassignment-json-file option). This can either be a custom reassignment plan hand crafted by the admin or provided by using the --generate option
--execute: 这个选项命令，是执行你用--generate 生成的分配规则json文件的，（用--reassignment-json-file 选项），可以是自定义的分配计划，也可以是由管理员或通过--generate选项生成的。
--verify: In this mode, the tool verifies the status of the reassignment for all partitions listed during the last --execute. The status can be either of successfully completed, failed or in progress
--verify: 这个选项命令，是验证执行--execute重新分配后，列出所有分区的状态，状态可以是成功完成，失败或正在进行中的。

自动将数据迁移到新机器

The partition reassignment tool can be used to move some topics off of the current set of brokers to the newly added brokers. This is typically useful while expanding an existing cluster since it is easier to move entire topics to the new set of brokers, than moving one partition at a time. When used to do this, the user should provide a list of topics that should be moved to the new set of brokers and a target list of new brokers. The tool then evenly distributes all partitions for the given list of topics across the new set of brokers. During this move, the replication factor of the topic is kept constant. Effectively the replicas for all partitions for the input list of topics are moved from the old set of brokers to the newly added brokers.

使用分区重新分配工具将从当前的broker集的一些topic移到新添加的broker。同时扩大现有集群，因为这很容易将整个topic移动到新的broker，而不是每次移动一个parition，你要提供新的broker和新broker的目标列表的topic列表（就是刚才的生成的json文件）。然后工具将根据你提供的列表把topic的所有parition均匀地分布在所有的broker，topic的副本保持不变。

For instance, the following example will move all partitions for topics foo1,foo2 to the new set of brokers 5,6. At the end of this move, all partitions for topics foo1 and foo2 will only exist on brokers 5,6
例如，下面的例子将主题foo1，foo2的所有分区移动到新的broker 5，6。移动结束后，主题foo1和foo2所有的分区都会只会在broker 5，6。

注意：站长友情提示各位kafka学习者，下面所有的json文件，都是要你自己新建的，不是自动创建的，需要你自己把生成的规则复制到你新建的json文件里，然后执行。

Since, the tool accepts the input list of topics as a json file, you first need to identify the topics you want to move and create the json file as follows-
执行迁移工具需要接收一个json文件，首先需要你确认topic的迁移计划并创建json文件，如下所示

> cat topics-to-move.json
{"topics": [{"topic": "foo1"},
            {"topic": "foo2"}],
 "version":1
}

Once the json file is ready, use the partition reassignment tool to generate a candidate assignment-

一旦json准备好，使用分区重新分配工具生成一个“候选人”分配规则 -

> bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --topics-to-move-json-file topics-to-move.json --broker-list "5,6" --generate 
Current partition replica assignment

{"version":1,
 "partitions":[{"topic":"foo1","partition":2,"replicas":[1,2]},
               {"topic":"foo1","partition":0,"replicas":[3,4]},
               {"topic":"foo2","partition":2,"replicas":[1,2]},
               {"topic":"foo2","partition":0,"replicas":[3,4]},
               {"topic":"foo1","partition":1,"replicas":[2,3]},
               {"topic":"foo2","partition":1,"replicas":[2,3]}]
}

Proposed partition reassignment configuration

{"version":1,
 "partitions":[{"topic":"foo1","partition":2,"replicas":[5,6]},
               {"topic":"foo1","partition":0,"replicas":[5,6]},
               {"topic":"foo2","partition":2,"replicas":[5,6]},
               {"topic":"foo2","partition":0,"replicas":[5,6]},
               {"topic":"foo1","partition":1,"replicas":[5,6]},
               {"topic":"foo2","partition":1,"replicas":[5,6]}]
}

The tool generates a candidate assignment that will move all partitions from topics foo1,foo2 to brokers 5,6. Note, however, that at this point, the partition movement has not started, it merely tells you the current assignment and the proposed new assignment. The current assignment should be saved in case you want to rollback to it. The new assignment should be saved in a json file (e.g. expand-cluster-reassignment.json) to be input to the tool with the --execute option as follows-
生成从主题foo1，foo2迁移所有的分区到broker 5，6的候选人分配规则。注意，这个时候，迁移还没有开始，它只是告诉你当前分配和新的分配规则，当前分配规则用来回滚，新的分配规则保存在json文件（例如，我保存在 expand-cluster-reassignment.json这个文件下）然后，用--execute选项来执行它。

> bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json --execute
Current partition replica assignment

{"version":1,
 "partitions":[{"topic":"foo1","partition":2,"replicas":[1,2]},
               {"topic":"foo1","partition":0,"replicas":[3,4]},
               {"topic":"foo2","partition":2,"replicas":[1,2]},
               {"topic":"foo2","partition":0,"replicas":[3,4]},
               {"topic":"foo1","partition":1,"replicas":[2,3]},
               {"topic":"foo2","partition":1,"replicas":[2,3]}]
}

Save this to use as the --reassignment-json-file option during rollback
Successfully started reassignment of partitions
{"version":1,
 "partitions":[{"topic":"foo1","partition":2,"replicas":[5,6]},
               {"topic":"foo1","partition":0,"replicas":[5,6]},
               {"topic":"foo2","partition":2,"replicas":[5,6]},
               {"topic":"foo2","partition":0,"replicas":[5,6]},
               {"topic":"foo1","partition":1,"replicas":[5,6]},
               {"topic":"foo2","partition":1,"replicas":[5,6]}]
}

Finally, the --verify option can be used with the tool to check the status of the partition reassignment. Note that the same expand-cluster-reassignment.json (used with the --execute option) should be used with the --verify option

最后，--verify 选项用来检查parition重新分配的状态，注意， expand-cluster-reassignment.json（与--execute选项使用的相同）和--verify选项一起使用。

> bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json --verify
Status of partition reassignment:
Reassignment of partition [foo1,0] completed successfully
Reassignment of partition [foo1,1] is in progress
Reassignment of partition [foo1,2] is in progress
Reassignment of partition [foo2,0] completed successfully
Reassignment of partition [foo2,1] completed successfully 
Reassignment of partition [foo2,2] completed successfully

自定义分区分配和迁移

The partition reassignment tool can also be used to selectively move replicas of a partition to a specific set of brokers. When used in this manner, it is assumed that the user knows the reassignment plan and does not require the tool to generate a candidate reassignment, effectively skipping the --generate step and moving straight to the --execute step
分区重新分配工具也可以有选择性将分区副本移动到指定的broker。当用这种方式，假定你已经知道了分区规则，不需要通过工具生成规则，可以跳过--generate，直接使用—execute

For instance, the following example moves partition 0 of topic foo1 to brokers 5,6 and partition 1 of topic foo2 to brokers 2,3
例如，下面的例子是移动主题foo1的分区0到brokers 5，6 和主题foo2的分区1到broker 2，3。

The first step is to hand craft the custom reassignment plan in a json file-

第一步是，手工写一个自定义的分配计划到json文件中 -

> cat custom-reassignment.json
{"version":1,"partitions":[{"topic":"foo1","partition":0,"replicas":[5,6]},{"topic":"foo2","partition":1,"replicas":[2,3]}]}

Then, use the json file with the --execute option to start the reassignment process-

然后，--execute 选项执行分配处理 -

> bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file custom-reassignment.json --execute
Current partition replica assignment

{"version":1,
 "partitions":[{"topic":"foo1","partition":0,"replicas":[1,2]},
               {"topic":"foo2","partition":1,"replicas":[3,4]}]
}

Save this to use as the --reassignment-json-file option during rollback
Successfully started reassignment of partitions
{"version":1,
 "partitions":[{"topic":"foo1","partition":0,"replicas":[5,6]},
               {"topic":"foo2","partition":1,"replicas":[2,3]}]
}

The --verify option can be used with the tool to check the status of the partition reassignment. Note that the same expand-cluster-reassignment.json (used with the --execute option) should be used with the --verify option

最后使用--verify 验证。

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file custom-reassignment.json --verify
Status of partition reassignment:
Reassignment of partition [foo1,0] completed successfully
Reassignment of partition [foo2,1] completed successfully

kafka退役broker

The partition reassignment tool does not have the ability to automatically generate a reassignment plan for decommissioning brokers yet. As such, the admin has to come up with a reassignment plan to move the replica for all partitions hosted on the broker to be decommissioned, to the rest of the brokers. This can be relatively tedious as the reassignment needs to ensure that all the replicas are not moved from the decommissioned broker to only one other broker. To make this process effortless, we plan to add tooling support for decommissioning brokers in the future.

分区重新分配工具没有自动生成退役broker的重新分配规则的能力。因此，管理员要有一个重新分配计划，迁移broker上所有要停运的副本到其余的broker上。这会比较繁琐，因为重新分配要确保所有的副本从退役的broker迁移到另一个没有停运的broker。为了使这个过程更轻松，我们计划在未来增加退役broker的工具支持。

kafka增加副本

增加副本

Increasing the replication factor of an existing partition is easy. Just specify the extra replicas in the custom reassignment json file and use it with the --execute option to increase the replication factor of the specified partitions.

在现有分区增加副本是很容易的，只要指定自定义的重新分配的json文件脚本，并用 --execute 选项去执行这个脚本。

For instance, the following example increases the replication factor of partition 0 of topic foo from 1 to 3. Before increasing the replication factor, the partition's only replica existed on broker 5. As part of increasing the replication factor, we will add more replicas on brokers 6 and 7.

增加副本之前，分区已存在的副本在broker5上，它也作为增加副本的一部分，我们将副本添加到broker6和7上。

The first step is to hand craft the custom reassignment plan in a json file-

第一步手工写一个自定义的分配的json脚本 -

> cat increase-replication-factor.json
{"version":1,
 "partitions":[{"topic":"foo","partition":0,"replicas":[5,6,7]}]}

Then, use the json file with the --execute option to start the reassignment process-

然后，用--execute选项运行json脚本。

> bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --execute
Current partition replica assignment

{"version":1,
 "partitions":[{"topic":"foo","partition":0,"replicas":[5]}]}

Save this to use as the --reassignment-json-file option during rollback
Successfully started reassignment of partitions
{"version":1,
 "partitions":[{"topic":"foo","partition":0,"replicas":[5,6,7]}]}

The --verify option can be used with the tool to check the status of the partition reassignment. Note that the same increase-replication-factor.json (used with the --execute option) should be used with the --verify option

-- version 选项来验证parition分配的状态。注意，使用同样的 increase-replication-factor.json

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --verify
Status of partition reassignment:
Reassignment of partition [foo,0] completed successfully

You can also verify the increase in replication factor with the kafka-topics tool-

你也可以使用kafka-topic工具验证-

> bin/kafka-topics.sh --zookeeper localhost:2181 --topic foo --describe
Topic:foo    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: foo    Partition: 0    Leader: 5    Replicas: 5,6,7    Isr: 5,6,7

kafka彻底删除topic

kafka0.8.1.1以及之前版本都无法使用类似一条命令就彻底删除topic，以前看过网上一些删除命令不过只是在zookeeper注销信息而已，但是实际的日志内容还是保存在kafka log中，因为个人需要所以慢慢琢磨了方法彻底清除topic（ps:kafka0.8.2好像直接支持直接删除，不过现在还是beta版）。

机器环境如下：

Kafka目录：/usr/local/kafka_2.10-0.8.1.1
日志保存目录log.dirs：/data1/kafka/log/
删除的topic名字：zitest2

1、从zookeerer删除信息：

/usr/local/kafka_2.10-0.8.1.1/bin/kafka-run-class.shkafka.admin.DeleteTopicCommand --zookeeper 10.12.0.91:2181,10.12.0.92:2181,10.12.0.93:2181/kafka--topic zitest2

成功后返回信息：deletion succeeded!

2、JPS查看kill掉QuorumPeerMain和Kafka进程

3、从log.dirs目录删除文件，可以看到多个子目录名字如zitest2-0,zitest2-1…zitest2-n（就是你topic的partition个数）

rm  –fr  zitest2-0……zitest2-n

4、修改日志目录的recovery-point-offset-checkpoint和replication-offset-checkpoint文件（要小心删除，否则待会kafka不能正常启动起来）

replication-offset-checkpoint格式如下：

0
4(partition总数)
zitest2 0 0
zitest2  3 0
hehe 0 0
hehe 1 0

修改后如下：

0
2(partition总数)
hehe 0 0
hehe 1 0

把含有zitest2行全部去掉，并且把partition总数修改为减去zitest2的partition的剩余数目，同理 recovery-point-offset-checkpoint也是这样修改。

完成后就可以正常启动zookeeper和kafka。

kafka在数据迁移期间限制带宽的使用

Kafka提供一个broker之间复制传输的流量限制，限制了副本从机器到另一台机器的带宽上限。当重新平衡集群，引导新broker，添加或移除broker时候，这是很有用的。因为它限制了这些密集型的数据操作从而保障了对用户的影响。

有2个接口可以实现限制。最简单和最安全的是调用kafka-reassign-partitions.sh时加限制。另外kafka-configs.sh也可以直接查看和修改限制值。

例如，当执行重新平衡时，用下面的命令，它在移动分区时，将不会超过50MB/s。

$ bin/kafka-reassign-partitions.sh --zookeeper myhost:2181--execute --reassignment-json-file bigger-cluster.json —throttle 50000000

当你运行这个脚本，你会看到这个限制：

The throttle limit was set to 50000000 B/s
Successfully started reassignment of partitions.

如果你想在重新平衡期间修改限制，增加吞吐量，以便完成的更快。你可以重新运行execute命令，用相同的reassignment-json-file:

$ bin/kafka-reassign-partitions.sh --zookeeper localhost:2181  --execute --reassignment-json-file bigger-cluster.json --throttle 700000000

  There is an existing assignment running.
  The throttle limit was set to 700000000 B/s

一旦重新平衡完成，可以使用--verify操作验证重新平衡的状态。如果重新平衡已经完成，限制也会通过--verify命令移除。这点很重要，因为一旦重新平衡完成，并通过--veriry操作及时移除限制。否则可能会导致定期复制操作的流量也受到限制。

当--verify执行，并且重新分配已完成时，此脚本将确认限制被移除：

$ bin/kafka-reassign-partitions.sh --zookeeper localhost:2181  --verify --reassignment-json-file bigger-cluster.json
  Status of partition reassignment:
  Reassignment of partition [my-topic,1] completed successfully
  Reassignment of partition [mytopic,0] completed successfully
  Throttle was removed.

管理员还可以使用kafka-configs.sh验证已分配的配置。有2对限制配置用于管理限流。而限制值本身，是个broker级别的配置，用于动态属性配置：

leader.replication.throttled.rate
  follower.replication.throttled.rate

此外，还有枚举集合的限流副本：

leader.replication.throttled.replicas
  follower.replication.throttled.replicas

其中每个topic配置，所有4个配置值通过kafka-reassign-partitions.sh（下面讨论）自动分配。

查看限流配置：

$ bin/kafka-configs.sh --describe --zookeeper localhost:2181 --entity-type brokers
  Configs for brokers '2' are leader.replication.throttled.rate=700000000,follower.replication.throttled.rate=700000000
  Configs for brokers '1' are leader.replication.throttled.rate=700000000,follower.replication.throttled.rate=700000000

这显示了应用于复制协议的leader和follower的限制。默认情况下，2个都分配了相同的限制值。

要查看限流副本的列表：

$ bin/kafka-configs.sh --describe --zookeeper localhost:2181 --entity-type topics
  Configs for topic 'my-topic' are leader.replication.throttled.replicas=1:102,0:101,
      follower.replication.throttled.replicas=1:101,0:102

这里我们看到leader限制被应用到broker 102上的分区1和broker 101的分区0.同样，follower限制应用到broker 101的分区1和broker 102的分区0.

默认情况下，kafka-reassign-partitions.sh会将leader限制应用于重新平衡前存在的所有副本，任何一个副本都可能是leader。它将应用follower限制到所有移动目的地。因此，如果broker 101，102上有一个副本分区，被分配给102，103，则该分区的leader限制，将被应用到101，102，并且follower限制将仅被应用于103。

如果需要，你还可以使用kafka-configs.sh的--alter开关手动地更改限制配置。

安全的使用限制复制

在使用限制复制时应各位的小心，特别是：

(1) 限制移除:

一旦重新分配完成，限制应该及时的移除（通过运行kafka-reassign-partitions —verify移除）。

(2) 确保进展:

如果限制设置的太低，与传入的写入速率相比，复制可能无法进行：

max(BytesInPerSec) > throttle

其中BytesInPerSec是监控生产者写入到broker的吞吐量。

可以使用该命令监视重新平衡期间复制是否在进行，使用以下方式：

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)

在复制期间落后应不断地减少，如果没有缩小，则管理员通过上面介绍的方式增加限制的吞吐量。

设置配额

默认情况下，客户端的配额不受限制。可以为每个（user，client-id），user或client-id分组设置自定义的配额。

配置自定义的配额（user=user1,client-id=clientA）:

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users --entity-name user1 --entity-type clients --entity-name clientA
Updated config for entity: user-principal 'user1', client-id 'clientA'.

为user=user1配置自定义的配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users --entity-name user1
Updated config for entity: user-principal 'user1'.

为client-id=clientA配置自定义的配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type clients --entity-name clientA
Updated config for entity: client-id 'clientA'.

可以通过--entity-default为(user,client-id),user或client-id group设置默认的配额。
为user=userA配置默认client-id配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users --entity-name user1 --entity-type clients --entity-default
Updated config for entity: user-principal 'user1', default client-id.

为user配置默认配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users --entity-default
Updated config for entity: default user-principal.

为client-id配置默认配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type clients --entity-default
Updated config for entity: default client-id.

为指定的（user,client-id）展示配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --describe --entity-type users --entity-name user1 --entity-type clients --entity-name clientA
Configs for user-principal 'user1', client-id 'clientA' are producer_byte_rate=1024,consumer_byte_rate=2048

为指定的user展示配额：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --describe --entity-type users --entity-name user1
Configs for user-principal 'user1' are producer_byte_rate=1024,consumer_byte_rate=2048

为指定的client-id展示配额。

> bin/kafka-configs.sh  --zookeeper localhost:2181 --describe --entity-type clients --entity-name clientA
Configs for client-id 'clientA' are producer_byte_rate=1024,consumer_byte_rate=2048

如果没有指定名称，则展示指定的类型的，查看所有user：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --describe --entity-type users
Configs for user-principal 'user1' are producer_byte_rate=1024,consumer_byte_rate=2048
Configs for default user-principal are producer_byte_rate=1024,consumer_byte_rate=2048

（user,client）也是一样：

> bin/kafka-configs.sh  --zookeeper localhost:2181 --describe --entity-type users --entity-type clients
Configs for user-principal 'user1', default client-id are producer_byte_rate=1024,consumer_byte_rate=2048
Configs for user-principal 'user1', client-id 'clientA' are producer_byte_rate=1024,consumer_byte_rate=2048

可以通过在broker中设置以下配置来设置适用于所有client-id的默认配额。仅当未在Zookeeper中配置配额覆盖或默认值时，才应用这些属性。默认情况下，每个client-id接收一个无限制的配额。以下将每个producer和consumer client-id的默认配额设置为10MB /秒。

quota.producer.default=10485760
quota.consumer.default=10485760

请注意，这些属性已被弃用，可能会在将来的版本中删除。请优先使用kafka-configs.sh。

从老版本升级kafka

从0.8.x, 0.9.x 或 0.10.0.X 升级到 0.10.1.0

0.10.1.0有线协议更改，通过遵循以下建议的滚动升级，在升级期间不会停机。但是，需要注意升0.10.1.0中潜在的突发状况。

注意：由于引入了新的协议，要在升级客户端之前先升级kafka集群（即，0.10.1.x仅支持 0.10.1.x或更高版本的broker，但是0.10.1.x的broker向下支持旧版本的客户端）

滚动升级:

更新所有broker的server.properties文件，并添加以下属性：
- inter.broker.protocol.version=CURRENT_KAFKA_VERSION (如：0.8.2.0, 0.9.0.0或0.10.0.0).
- log.message.format.version=CURRENT_KAFKA_VERSION (有关此配置的详细信息，请查看升级后潜在的性能影响。)
每次升级一个broker：关闭broker，替换新版本，然后重新启动它。
一旦整个群集升级，通过编辑inter.broker.protocol.version并将其设置为0.10.1.0来转换所有协议。
如果之前的消息格式是0.10.0，则将log.message.format.version更改为0.10.1（这无影响，因为0.10.0和0.10.1的消息格式是相同的）。如果之前的消息格式版本低于.10.0,还不能更改log.message.format.version - 一旦所有的消费者都已升级到 0.10.0.0 或更高版本时，才能更改此参数。
逐个重新启动broker，使新协议版本生效。
如果log.message.format.version低于0.10.0，请等待，知道所有消费者升级到0.10.0或更新的版本，然后将每个broker的log.message.format.version更改为0.10.1。然后逐个重启。

注意：如果你可接受停机，你可以简单地将所有broker关闭，更新版本并重启启动，它们将默认从新版本开始。

注意：变换协议版本和重启启动可以在broker升级完成后的任何时间去做，不必马上做。

在0.10.1.0中潜在的变化

日志保留时间不再基于日志段的最后修改时间。相反，它将基于日志段中消息的最大时间戳。
日志滚动时间不再取决于日志段的创建时间。而是基于消息中的时间戳。进一步来说。如果日志段中第一个消息的时间戳是T，则当新的消息的时间戳大于或等于T+log.roll.ms时，日志将推出。
0.10.0 的打开的文件处理将增加了约33%，因为为每个段增加时间索引文件。
时间索引和offset索引共享相同的索引大小配置。因为每个时间索引条目是offset索引条目的1.5备。用户可能需要增加log.index.size.max.bytes以避免频繁的日志滚动。
由于索引文件数量增加，对于一些有大量日志段的broker（即 >15k），在broker启动期间，日志加载处理可能更长。根据我们的实现，num.recovery.threads.per.data.dir设置为1可减少日志加载的时间。

0.10.1.0显著的变化

新的java消费者不再是测试阶段了，我们建议将其应用到所有的新开发当中。旧的Scala使用仍然支持，但将在下一个版本中弃用，并在未来的主要版本中移除。
--new-consumer/--new.consumer转换不再需要使用MirrorMaker和类似于Console消费者工具。只需要通过一个Kafka broker连接，而不是ZooKeeper了。另外，控制台消费者和旧消费者已弃用，并且将在未来的主要版本中移除。
Kafka集群现在可通过集群ID来标识唯一，broker升级到0.10.1.0时将自动的生成。集群ID可通过kafka.server:type=KafkaServer,name=ClusterId获取。它是元数据相应的一部分，序列化，客户端拦截器和度量记录器可通过实现ClusterResourceListener接口来接收集群ID。
BrokerState "RunningAsController" (value 4) 已被移除。由于一个bug，brpker仅在转换出来之前处于这种状态，因此移除影响应该是最小的。推荐的方法是通过kafka检查给定的broker是否是控制器。controller:type=KafkaController,name=ActiveControllerCount
新的Java消费者现在允许用户通过分区上的时间戳来搜索offset。
新的Java消费者现在支持后台线程心跳检测，有一个新的配置max.poll.interval.ms控制消费者主动离开组之前poll调用之间的最大时间（默认是5分钟）。配置request.timeout.ms的值必须始终大于max.poll.interval.ms，因为JoinGroup请求在消费者重新平衡时候阻塞服务器的最大时间。因此我们更改了其默认值超过5分钟，最后，session.timeout.ms的默认值已调整为10秒，并max.poll.records的默认值更改为500。
当使用Authorizer并且用户对topic没有描述授权时，broker将不再向请求返回TOPIC_AUTHORIZATION_FAILED错误，因为这会泄漏topic名称。相反，将返回UNKNOWN_TOPIC_OR_PARTITION错误代码。当使用生产者和消费者时，这可能导致意外的超时或延迟，因为Kafka客户端通常将在未知的topic错误时自动重试。如果您怀疑这可能已经正在发生，你应该查阅客户端日志。
获取响应的默认的限制大小（消费者为50MB，副本为10MB）。现有的分区限制也适用（消费者和副本是1MB）。注意，这些限制不是绝对的最大值（下一节解释）。
如果一个消息大于响应/分区大小限制，消费者和副本可以继续使用。更具体的是，如果在第一个非空分区中的第一个消息大于限制，则消息将仍然返回。
kafka.api.FetchRequest和kafka.javaapi.FetchRequest中增加了重载的构造函数。以允许调用者去指定分区的顺序（因为在v3中顺序很重要）。之前的构造函数已弃用。在请求发送之前，以避免资源匮乏问题引起的混洗。

新协议版本

ListOffsetRequest v1支持基于时间戳的精确offset搜索。
MetadataResponse v2引入了一个新字段：“cluster_id”。
FetchRequest v3支持限制响应大小（除了现有的分区限制）。
JoinGroup v1引入了一个新字段：“rebalance_timeout”。

从0.8.x 或 0.9.x 升级到 0.10.0.0

0.10.0.0具有潜在的突变更改（请在升级之前查看），以及升级后可能的性能影响。通过遵循以下建议的滚动升级计划，可保障在升级期间和之后不会出现停机时间和性能影响。
注意：由于引入了新协议，因此在升级客户端之前先升级Kafka集群。

注意，对于版本0.9.0.0：由于0.9.0.0中有一个bug，依赖于Zookeeper（旧的Scala高级消费者和MirrorMaker如果一起使用）的客户端将无法在0.10.0.x中使用。因此，broker升级到0.10.0.x之前，先升级0.9.0.0客户端到0.9.0.1。对于0.8.X或0.9.0.1客户端，此步骤不是必需的。

滚动升级:

更新所有broker的server.properties文件，并添加以下配置：
- inter.broker.protocol.version=CURRENT_KAFKA_VERSION (例如：0.8.2 或 0.9.0.0).
- log.message.format.version=CURRENT_KAFKA_VERSION (有关此配置的详细信息，请查看升级后潜在的性能影响。)
升级broker，关闭它，然后升级到新版本，最后重启它。
一旦整个集群升级完成，通过编辑inter.broker.protocol.version设置为0.10.0.0转换所有协议。注意：你现在应该还不需要设置message.format.version - 此配置应该当所有的消费者升级为0.10.0.0时才需要设置。
依次重新启动broker，使新协议版本生效。
一旦所有的消费者已经升级为.10.0，设置每个broker的log.message.format.version为0.10.0，然后逐个重启。

注意：如果你接受停机目，你可以简单粗暴的关闭所有broker，更新版本并重新启动。它们默认从新协议开始。

注意：变换协议版本和重启启动可以在broker升级完成后的任何时间去做，不必马上做。

升级到0.10.0.0后潜在的性能影响

0.10.0中的消息格式包括新的时间戳字段，并使用压缩消息的相关联的offset。磁盘默认的消息格式是0.10.0，消息格式可以通过server.properties中的log.message.format.version配置。如果消费者客户端版本低于0.10.0.0。它只能“理解”0.10.0之前的消息格式。在这种情况下，broker在发送响应到旧版本消费者之前转换0.10.0格式到之前的格式。然而，这样的话，broker不是零复制传输。在Kafka社区关于性能影响的报告显示，在升级后，CPU利用率从20%提高100%。这迫使所有客户端马升级，促使性能恢复正常。为了避免消费者升级到0.10.0.0之前的消息转换，可以设置log.message.format.version为0.8.2或0.9.0。这样，broker仍然零复制传输将数据发送给旧的消费者。一旦消费者升级，就可以把消息格式更为0.10.0，就可以享受含新时间戳和优化后的压缩新消息格式。转换只是为了确保兼容性，尽可能避免消息转换才是至关重要的。

客户端升级到0.10.0.0，不会对性能产生影响。

注意：通过设置消息格式版本，可以证明所有现有消息处于或低于该消息格式版本。否则消费者在0.10.0.0之前可能会中断。特别是，在消息格式设置为0.10.0之后，不应将其更改回较早的格式，因为它可能会在0.10.0.0之前的版本上中断消费者。

注意：由于在每个消息中引入了额外的时间戳，生产者在发送少量消息可能会看到消息吞吐量下降（因为增加了开销）。同样，复制每个消息传输也增加了8个字节。如果你集群的能力与网络接近，可能会超过网卡，并看到由于过载的故障和性能问题。

注意：如果生产者已经启用了压缩，则在某些情况下，可能注意到生产者吞吐量减少或broker的压缩率降低。当接收压缩消息时，0.10.0的broker避免再次压缩消息，这样减少延迟并提高吞吐量。然而，在某些情况下，这可能减少生产者的批次大小，导致较差的吞吐量。如果出现这种情况，可调整生产者的linger.ms 和 batch.size以提高吞吐量。另外，生产者用于压缩消息的缓存小于broker生产者使用的缓存，这可能对磁盘上的消息的压缩比有负面影响。我们打算在未来的Kafka版本中进行配置。

0.10.0.0潜在的中断

从Kafka 0.10.0.0开始，Kafka中的消息格式版本表示为Kafka的版本。例如，消息格式0.9.0指的是支持的最高消息版本就是0.9.0。
消息格式0.10.0已经介绍过了，并且默认是使用的。消息包含了一个时间戳字段和压缩后消息的关系offset。
已经引入了ProduceRequest/Response v2，并默认使用支持消息格式0.10.0。
已经引入了FetchRequest/Response v2已经被引入，它默认使用支持消息格式0.10.0。
MessageFormatter 接口从def writeTo(key: Array[Byte], value: Array[Byte], output: PrintStream) 更改为 def writeTo(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]], output: PrintStream)
MessageReader 接口从 def readMessage(): KeyedMessage[Array[Byte], Array[Byte]] 更改为 def readMessage(): ProducerRecord[Array[Byte], Array[Byte]]
MessageFormatter的包从kafka.tools到kafka.common
MessageReader的包从kafka.tools到kafka.common
MirrorMakerMessageHandler不再处理（记录：MessageAndMetadata [Array [Byte]，Array [Byte]]）方法从未被调用用。
0.7版本的KafkaMigrationTool不再和kafka一起打包。如果你需要从0.7迁移到0.10.0，请先迁移到0.8，然后按照的升级步骤从0.8升级到0.10.0。
新消费者API已标准化，接收java.util.Collection作为方法参数的序列化类型。升级现有的版本才能使用0.10.0客户端库
LZ4压缩消息处理已更改为使用可互操作的规范框架(LZ4f v1.5.1)。为了保留与旧客户端的兼容性，此改变仅适用于消息格式为0.10.0和更高版本。使用v0/v1（消息格式0.9.0）Produce/Fetch LZ4压缩消息的客户端应继续使用0.9.0实现框架。使用Produce/Fetch协议v2或更高版本的客户端应使用可互操作的LZ4f框架。可互操作的LZ4库的列表可在https://www.lz4.org/查看

在0.10.0.0的显著变化

从0.10.0.0开始，增加一个新的客户端Kafka Streams客户端，用于流式处理存储在kafka topic的数据。这个新客户端仅支持0.10.x或更高的版本。
新消费者默认receive.buffer.bytes是64K。
新的消费者现在公开了exclude.internal.topics配置，以防止内部topic（例如消费者offset topic）被其他的正则匹配订阅。默认是启用。
旧的的Scala的生产者已经弃用。使用者尽快使用最新的Java客户端。

新的消费者API已标记为稳定。

从0.8.0, 0.8.1.X或0.8.2.X升级到0.9.0.0

9.0.0有潜在的中断更改风险（在升级之前需要知道），并且与之前版本的broker之间的协议改变。这意味着此次升级可能和客户端旧版本不兼容。因此在升级客户端之前，先升级kafka集群。如果你使用MirrorMaker下游集群，则同样应首先升级。

滚动升级

升级所有broker的server.properties,并在其中添加inter.broker.protocol.version = 0.8.2.X
每次升级一个broker：关闭broker，替换新版本，然后重新启动。
一旦整个群集升级，通过编辑inter.broker.protocol.version并将其设置为0.9.0.0来转换所有协议。
逐个重新启动broker，使新协议版本生效。

注意：如果你可接受停机，你可以简单地将所有broker关闭，更新版本并重启启动，协议将默认从新版本开始。

注意：变换协议版本和重启启动可以在broker升级完成后的任何时间去做，不必马上做。

0.9.0.0潜在的中断变化

Java 1.6不再支持。
Scala 2.9不再支持。
默认情况下，1000以上的Broker ID为自动分配。如果你的集群高于该阈值，需相应地增加reserved.broker.max.id配置。
replica.lag.max.messages配置已经移除。分区leader在决定哪些副本处于同步时将不再考虑落后的消息的数。
配置参数replica.lag.time.max.ms现在不仅指自上次从副本获取请求后经过的时间，还指自副本上次被捕获以来的时间。副本仍然从leader获取消息，但超过replica.lag.time.max.ms配置的最新消息将被认为不同步的。
压缩的topic不再接受没有key的消息，如果出现，生产者将抛出异常。在0.8.x中，没有key的消息将导致日志压缩线程退出（并停止所有压缩的topic）。
MirrorMaker不再支持多个目标集群。它只接受一个--consumer.config。要镜像多个源集群，每个源集群至少需要一个MirrorMaker实例，每个源集群都有自己的消费者配置。
在org.apache.kafka.clients.tools。包下的Tools已移至org.apache.kafka.tools。。所有包含的脚本仍将照常工作，只有直接导入这些类的自定义代码将受到影响。
在kafka-run-class.sh中更改了默认的Kafka JVM性能选项（KAFKA_JVM_PERFORMANCE_OPTS）。
kafka-topics.sh脚本（kafka.admin.TopicCommand）现在退出，失败时出现非零退出代码。
kafka-topics.sh脚本（kafka.admin.TopicCommand）现在将在topic名称由于使用“.”或“_”而导致风险度量标准冲突时打印警告。以及冲突的情况下的错误。
kafka-console-producer.sh脚本（kafka.tools.ConsoleProducer）将默认使用新的Java Producer，用户必须指定“old-producer”才能使用旧生产者。
默认情况下，所有命令行工具都会将所有日志消息打印到stderr而不是stdout。

0.9.0.1中的显著变化

可以通过将broker.id.generation.enable设置为false来禁用新的broker ID生成功能。
默认情况下，配置参数log.cleaner.enable为true。这意味着topic会清理。
policy = compact现在将被默认压缩，并且128MB的堆（通过log.cleaner.dedupe.buffer.size）分配给清洗进程。你可能需要根据你对压缩topic的使用情况，查看log.cleaner.dedupe.buffer.size和其他log.cleaner配置值。
默认情况下，新消费者的配置参数fetch.min.bytes的默认值为1。

0.9.0.0弃用的

kafka-topics.sh脚本的变更topic配置已弃用（kafka.admin.ConfigCommand），以后将使用kafka-configs.sh(kafka.admin.ConfigCommand) 。
kafka-consumer-offset-checker.sh(kafka.tools.ConsumerOffsetChecker)已弃用，以后将使用kafka-consumer-groups.sh (kafka.admin.ConsumerGroupCommand)
kafka.tools.ProducerPerformance已弃用。以后将使用org.apache.kafka.tools.ProducerPerformance（kafka-producer-perf-test.sh也将使用新类）
生产者的block.on.buffer.full已弃用，并将在以后的版本中移除。目前其默认已经更为false。KafkaProducer将不再抛出BufferExhaustedException，而是使用max.block.ms来中止，之后将抛出TimeoutException。如果block.on.buffer.full属性明确地设置为true，它将设置max.block.ms为Long.MAX_VALUE和metadata.fetch.timeout.ms将不执行。

从0.8.1升级到0.8.2

0.8.2与0.8.1完全兼容。关闭，更新代码并重新启动，逐个升级broker。

从0.8.0升级到0.8.1

0.8.1与0.8完全兼容。关闭，更新代码并重新启动，逐个升级broker。

从0.7升级

版本0.7与较新版本不兼容。对API，ZooKeeper数据结构，协议和配置进行了主要更改，以便添加复制（在0.7中缺失）。从0.7版升级到更高版本需要一个特殊的迁移工具（通过下一章的API）。此迁移可以在不停机的情况下完成。

kafka数据中心

Some deployments will need to manage a data pipeline that spans multiple datacenters. Our recommended approach to this is to deploy a local Kafka cluster in each datacenter with application instances in each datacenter interacting only with their local cluster and mirroring between clusters (see the documentation on the mirror maker tool for how to do this).
有些部署需要去管理跨多个数据中心的数据通道。对此，我们推荐的方法是在每个数据中心部署一套本地kafka集群，每个数据中心的应用程序实例只会影响它们本地集群和集群之间的镜像（查看镜像制造工具的文档，是如何做到这一点的）。

This deployment pattern allows datacenters to act as independent entities and allows us to manage and tune inter-datacenter replication centrally. This allows each facility to stand alone and operate even if the inter-datacenter links are unavailable: when this occurs the mirroring falls behind until the link is restored at which time it catches up.
这种部署模式允许数据中心当做独立的实体，使我们整体去管理和调整跨数据中心之间的复制。这使得每个设置都能独立的运转和操作，即使数据中心之间的链路不可用：当这种情况发生时落后的镜像，直到链路恢复了，此时，落后的镜像同步最新的镜像。

For applications that need a global view of all data you can use mirroring to provide clusters which have aggregate data mirrored from the local clusters in all datacenters. These aggregate clusters are used for reads by applications that require the full data set.
对于应用程序，它需要读取完整的数据集。你可以使用所有数据中心里本地集群已经汇总的数据镜像提供到集群，这些汇总的集群被应用程序读写。

This is not the only possible deployment pattern. It is possible to read from or write to a remote Kafka cluster over the WAN, though obviously this will add whatever latency is required to get the cluster.
这不是唯一的部署模式，它可通过WAN直接读或写到远程kafka集群，虽然很明显这将增加延迟获取集群。

Kafka naturally batches data in both the producer and consumer so it can achieve high-throughput even over a high-latency connection. To allow this though it may be necessary to increase the TCP socket buffer sizes for the producer, consumer, and broker using thesocket.send.buffer.bytes and socket.receive.buffer.bytesconfigurations. The appropriate way to set this is documented here.
Kafka轻松的同时在消费者和生产者进行批处理数据。因此它能在高延迟连接下实现高吞吐量，为实现这一点，它通过配置生产者，消费者和broker的thesocket.send.buffer.bytes和socket.receive.buffer.bytes 以增加TCP套接字缓存的大小。适当的设置，设置方法文档在这里。

It is generally not advisable to run a single Kafka cluster that spans multiple datacenters over a high-latency link. This will incur very high replication latency both for Kafka writes and ZooKeeper writes, and neither Kafka nor ZooKeeper will remain available in all locations if the network between locations is unavailable.
通常我们的不建议运行在高延迟链路跨多个数据中心的单一kafka集群。这将产生很高复制延迟无论是kafka的写入还是zookeeper的写入。如果网络在本地之间不可用，除kafka和zookeeper的将依然在本地保持可用。

kafka硬件和操作系统

We are using dual quad-core Intel Xeon machines with 24GB of memory. You need sufficient memory to buffer active readers and writers. You can do a back-of-the-envelope estimate of memory needs by assuming you want to be able to buffer for 30 seconds and compute your memory need as write_throughput x 30.
我们使用的是双路四核Intel Xeon（英特尔至强机）24GB内存。你需要足够的内存来缓存活跃的读和写。你可以对内存做一个粗略的估计，假设你想要能够缓冲为30秒，计算你的内存需求为 'write_throughput x 30'。

The disk throughput is important. We have 8x7200 rpm SATA drives. In general disk throughput is the performance bottleneck, and more disks is more better. Depending on how you configure flush behavior you may or may not benefit from more expensive disks (if you force flush often then higher RPM SAS drives may be better).
磁盘吞吐量是重要的。我们有 8 x 7200 rpm SATA 驱动器。在一般的磁盘吞吐量是性能瓶颈，磁盘是越多越好。这取决于你如何配置flush的行为，否则你也无法在昂贵的磁盘中受益。（如果你经常flush，高的RPM SAS驱动器可能会更好）。

kafka操作系统

Kafka should run well on any unix system and has been tested on Linux and Solaris.
Kafka应运行在任何unix系统，并已经在Linux和Solaris中测试过了。

We have seen a few issues running on Windows and Windows is not currently a well supported platform though we would be happy to change that.
我们已经看到，在Windows上运行有几个问题，Windows目前还不很好的支持，但我们会很乐意改变这种状况。

You likely don't need to do much OS-level tuning though there are a few things that will help performance.
你并不需要做太多OS级调整，有几件事情，这将有助于提高性能。

Two configurations that may be important:
两个很重要的配置：

We upped the number of file descriptors since we have lots of topics and lots of connections.
我们调升的文件描述符的数量，因为我们有很多topic和大量的连接。
We upped the max socket buffer size to enable high-performance data transfer between data centers described here.
我们调升最大套接字缓冲区大小，使这里介绍的数据中心之间的高性能数据传输。

kafka磁盘和文件系统

We recommend using multiple drives to get good throughput and not sharing the same drives used for Kafka data with application logs or other OS filesystem activity to ensure good latency. As of 0.8 you can either RAID these drives together into a single volume or format and mount each drive as its own directory. Since Kafka has replication the redundancy provided by RAID can also be provided at the application level. This choice has several tradeoffs.
我们推荐使用多种驱动来获取良好的吞吐量，而不是Kafka与应用程序日志或其他操作系统的文件系统共享相同的驱动。你可以将这些RAID驱动器一起打成一个卷或格式，并将每个驱动器作为其自己的目录。由于Kafka有副本功能，RAID提供的冗余也可以在应用程序级别提供。这个选择有几个权衡。

If you configure multiple data directories partitions will be assigned round-robin to data directories. Each partition will be entirely in one of the data directories. If data is not well balanced among partitions this can lead to load imbalance between disks.
如果配置多个数据目录，分区将被轮询分配个数据目录。每个分区将在一个数据目录中（完全的）。如果数据在分区之间没有平衡，这将导致磁盘之间的负载不平衡。

RAID can potentially do better at balancing load between disks (although it doesn't always seem to) because it balances load at a lower level. The primary downside of RAID is that it is usually a big performance hit for write throughput and reduces the available disk space.
RAID可以在平衡磁盘负载之间做的更好（尽管并不是总是这样），因为它在低级别平衡负载。RAID的主要缺点是通常对写入吞吐造成很大的性能损失并减少可用的磁盘空间。

Another potential benefit of RAID is the ability to tolerate disk failures. However our experience has been that rebuilding the RAID array is so I/O intensive that it effectively disables the server, so this does not provide much real availability improvement.
RAID的另一个优点是能够容忍磁盘故障，但是，以我们的经验来看，重建RAID阵列需要短时间内进行大量I/O操作，实际上会导致服务器不可用，因此不能在可用性方面提供太多改进。

kafka应用程序与操作系统的冲洗管理

Kafka always immediately writes all data to the filesystem and supports the ability to configure the flush policy that controls when data is forced out of the OS cache and onto disk using the and flush. This flush policy can be controlled to force data to disk after a period of time or after a certain number of messages has been written. There are several choices in this configuration.
Kafka一直都是立即把所有数据写入文件系统，并支持使用flush（冲洗）功能将数据从操作系统缓存冲洗到磁盘上。这个冲洗策略可控制在“一段时间之后”或“消息到一定数量之后”强制数据写入磁盘，在这个配置中有几个选择。

Kafka must eventually call fsync to know that data was flushed. When recovering from a crash for any log segment not known to be fsync'd Kafka will check the integrity of each message by checking its CRC and also rebuild the accompanying offset index file as part of the recovery process executed on startup.
Kafka最终必须调用fsync知道数据被刷新。当从崩溃中恢复任何未知为fsync的日志段时，Kafka将通过检查每个消息的CRC来检查每个消息的完整性，并且还将重新生成伴随的offset索引文件，作为启动时执行的恢复过程的一部分。

Note that durability in Kafka does not require syncing data to disk, as a failed node will always recover from its replicas
注意，kafka的耐久性不需要同步数据到磁盘，因为失败的节点会从它的副本恢复。

We recommend using the default flush settings which disable application fsync entirely. This means relying on the background flush done by the OS and Kafka's own background flush. This provides the best of all worlds for most uses: no knobs to tune, great throughput and latency, and full recovery guarantees. We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the paranoid still may prefer having both and application level fsync policies are still supported.
我们推荐使用默认的设置，完全禁用fsync应用。这意味着依赖操作系统和kafka自己的后台冲洗，最适合大多数使用：无需调整，大吞吐量和延迟，以及全面恢复保证，我们一般认为，通过副本提供的保证比同步到本地磁盘更强，但是，偏执狂仍然支持应用级fsync策略。

The drawback of using application level flush settings are that this is less efficient in it's disk usage pattern (it gives the OS less leeway to re-order writes) and it can introduce latency as fsync in most Linux filesystems blocks writes to the file whereas the background flushing does much more granular page-level locking.
使用应用程序级别刷新设置的缺点是它的磁盘使用模式效率较低（它给操作系统减少了重新排序写操作的余地），并且可能引入延迟，因为fsync在大多数Linux文件系统中阻塞写入文件，而后台刷新进行更细粒度的页面级锁定。

In general you don't need to do any low-level tuning of the filesystem, but in the next few sections we will go over some of this in case it is useful.
一般情况下你不需要做任何底层文件系统的调优，但在接下来的几节中，我们将讨论一些这样的情况。

kafka了解Linux操作系统的冲洗行为

In Linux, data written to the filesystem is maintained in pagecache until it must be written out to disk (due to an application-level fsync or the OS's own flush policy). The flushing of data is done by a set of background threads called pdflush (or in post 2.6.32 kernels "flusher threads").
在Linux中，写入文件系统的数据保存在页缓存，知道它必须被写入到磁盘（由应用程序级fsync或系统自己的冲洗策略）。数据的冲洗是通过一组后台线程调用pdflush（或在post 2.6.32内核 ”冲洗器线程”）完成的。

Pdflush has a configurable policy that controls how much dirty data can be maintained in cache and for how long before it must be written back to disk. This policy is described here. When Pdflush cannot keep up with the rate of data being written it will eventually cause the writing process to block incurring latency in the writes to slow down the accumulation of data.
Pdflush可通过配置策略来控制多少脏数据可以保存在缓存和多长时间之前必须写回到磁盘。策略说明（英文链接）。当Pdflush无法跟上数据写入的速率时，它最终会导致写入进程块引起数据延迟累积变慢。

You can see the current state of OS memory usage by doing
您可以通过执行看到操作系统的内存使用情况的当前状态

> cat /proc/meminfo

The meaning of these values are described in the link above.
在上面的链接中介绍了这些值的含义。

Using pagecache has several advantages over an in-process cache for storing data that will be written out to disk:
页缓存有一个进程内缓存，用于存储将被写入到磁盘的数据有几个优点：

The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.
I/O调度器将连续的小写入一起打包成更大的物理写，从而提高吞吐量。
The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput.
I/O调度器尝试重新排序写入顺序，以尽量减少磁头的运动从而提高吞吐量。
It automatically uses all the free memory on the machine
它会自动使用计算机上的所有可用的内存

kafka ext4文件系统的注意事项

Ext4 may or may not be the best filesystem for Kafka. Filesystems like XFS supposedly handle locking during fsync better. We have only tried Ext4, though.
Ext4可能是也可能不是最好的文件系统，据说像XFS文件系统处理锁fsync更好，我们只尝试过Ext4。

EXT4 is a serviceable choice of filesystem for the Kafka data directories, however getting the most performance out of it will require adjusting several mount options. In addition, these options are generally unsafe in a failure scenario, and will result in much more data loss and corruption. For a single broker failure, this is not much of a concern as the disk can be wiped and the replicas rebuilt from the cluster. In a multiple-failure scenario, such as a power outage, this can mean underlying filesystem (and therefore data) corruption that is not easily recoverable. The following options can be adjusted:
EXT4可供Kafka数据目录文件系统选择。要获得最佳的性能将需要几个挂载选择。另外，这些选项在故障情况下通常是不安全的。并且将导致更多的数据丢失和损坏。对于单个broker故障，无心担心，可以擦除磁盘，并从集群重建副本。在多故障情况下，如断电，这可能意味着底层文件系统（数据）损坏，这是不容易恢复的。可以调整以下选项：

It is not necessary to tune these settings, however those wanting to optimize performance have a few knobs that will help:
这些设置不是必须的，提供了一些方式，将帮助优化性能：

data=writeback: Ext4 defaults to data=ordered which puts a strong order on some writes. Kafka does not require this ordering as it does very paranoid data recovery on all unflushed log. This setting removes the ordering constraint and seems to significantly reduce latency.
data=writeback: 默认Ext4为data=ordered（对一些写入设置强顺序写入）。Kafka不需要，因为kafka的数据在所有未冲洗的日志上恢复。此设置删除了排序约束，并且显著地减少了延迟（似乎）。
Disabling journaling: Journaling is a tradeoff: it makes reboots faster after server crashes but it introduces a great deal of additional locking which adds variance to write performance. Those who don't care about reboot time and want to reduce a major source of write latency spikes can turn off journaling entirely.
禁用日志记录：日志是一个权衡，它使服务器崩溃后重启更快，但它引入了大量的额外的锁定，这增加了写性能的差异。如果不关心重启时间，想减少写入高峰延迟，可以完全关闭日志记录。
commit=num_secs: This tunes the frequency with which ext4 commits to its metadata journal. Setting this to a lower value reduces the loss of unflushed data during a crash. Setting this to a higher value will improve throughput.
mmit=num_secs：调整ext4提交到其元数据日志的频率。将此值设置的较低可减少崩溃期间未冲洗数据的丢失。设置的越高将提高吞吐量。
nobh: This setting controls additional ordering guarantees when using data=writeback mode. This should be safe with Kafka as we do not depend on write ordering and improves throughput and latency.
nobh：此设置是当使用data=writeback时，控制额外的排序保证。这应该与kafka是安全的，因为我们不依赖写入顺序，并提高吞吐量和延迟。
delalloc: Delayed allocation means that the filesystem avoid allocating any blocks until the physical write occurs. This allows ext4 to allocate a large extent instead of smaller pages and helps ensure the data is written sequentially. This feature is great for throughput. It does seem to involve some locking in the filesystem which adds a bit of latency variance.
delalloc: 延迟分配意味着文件系统避免分配任何块，直到物理写入发生。这允许ext4分配很大程度上代替小的页面并确保数据按顺序写入。这一特性非常适合吞吐量，它涉及在文件系统中增加了一些延迟差异锁（似乎）。

kafka监控

6.6 监控

Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
Kafka服务端和Java客户端使用Yammer Metrics来报告指标。它是一个内置的度量标准注册表，两者都可通过JMX暴露指标，可插拔式的统计报告信息，可连接到你自己的监视系统。

All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.
所有Kafka比率指标都有一个后缀为-total累积计数指标。例如，records-consumed-rate的对应度量是records-consumed-total。

The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.
最简单的方式是通过启动jconsole并将其指向正在运行的kafka客户端或服务器来查看可用的指标（基于JMX）；

使用JMX进行远程监控的安全注意事项

Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.
默认情况下，Apache Kafka远程JMX是禁用的。您可以通过为使用CLI或标准Java系统属性启动的进程设置环境变量JMX_PORT来启用JMX的远程监视，以通过编程方式启用远程JMX。在生产场景中启用远程JMX时，必须启用安全性，以确保未经授权的用户无法监视或控制您的代理或应用程序以及运行它们的平台。请注意，默认情况下，Kafka中对JMX的身份验证是禁用的，对于生产部署，必须通过为使用CLI启动的进程设置环境变量KAFKA_JMX_OPTS或通过设置适当的Java系统属性来覆盖安全配置。

以下是指标介绍：

描述	MBEAN NAME	NORMAL VALUE
Message in rate 消息速率	kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Byte in rate from clients 客户端字节速率	kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Byte in rate from other 其他brokers字节速率	kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
Request rate 请求速率	kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce\|FetchConsumer\|FetchFollower}
Error rate 错误速率	kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)	Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Request size in bytes 请求大小（以字节为单位）	kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)	Size of requests for each request type.
Temporary memory size in bytes 临时内存大小（以字节为段位）	kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request={Produce\|Fetch}	Temporary memory used for message format conversions and decompression.
Message conversion time 消息转换时间	kafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce\|Fetch}	Time in milliseconds spent on message format conversions.
Message conversion rate 消息转换比率	kafka.server:type=BrokerTopicMetrics,name={Produce\|Fetch}MessageConversionsPerSec,topic=([-.\w]+)	Number of records which required message format conversion.
Byte out rate to clients 向客户的字节输出率	kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Byte out rate to other brokers 对其他broker的字节输出率	kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
Message validation failure rate due to no key specified for compacted topic 由于未为压缩topic指定key，消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec
Message validation failure rate due to invalid magic number 无效的magic导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec
Message validation failure rate due to incorrect crc checksum 由于错误的crc校验和导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec
Message validation failure rate due to non-continuous offset or sequence number in batch 由于不连续offset或批处理中的序列号，导致消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec
Log flush rate and time 日志刷新率和时间	kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of under replicated partitions (\|ISR\|< \|all replicas\|)	kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions	0
# of under minIsr partitions (\|ISR\| < min.insync.replicas)	kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount	0
# of at minIsr partitions (\|ISR\| = min.insync.replicas)	kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount	0
# of offline log directories 脱机日志目录	kafka.log:type=LogManager,name=OfflineLogDirectoryCount	0
Is controller active on broker 控制器在broker上是否活跃	kafka.controller:type=KafkaController,name=ActiveControllerCount	only one broker in the cluster should have 1
Leader election rate leader选举率	kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs	non-zero when there are broker failures
Unclean leader election rate 未清理的leader选举率	kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec	0
Pending topic deletes 待删除主题	kafka.controller:type=KafkaController,name=TopicsToDeleteCount
Pending replica deletes 待删除的副本	kafka.controller:type=KafkaController,name=ReplicasToDeleteCount
Ineligible pending topic deletes 不合格的待删除主题	kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount
Ineligible pending replica deletes 不合格的待删除副本	kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount
Partition counts 分区数	kafka.server:type=ReplicaManager,name=PartitionCount	mostly even across brokers
Leader replica counts Leader副本数	kafka.server:type=ReplicaManager,name=LeaderCount	mostly even across brokers
ISR shrink rate ISR收缩率	kafka.server:type=ReplicaManager,name=IsrShrinksPerSec	If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate ISR扩展率	kafka.server:type=ReplicaManager,name=IsrExpandsPerSec	See above
Max lag in messages btw follower and leader replicas follower副本和leader副本之间的最大消息延迟	kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica	lag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replica 每个follower副本的消息延迟	kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)	lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatory 请求在生产者purgatory中等待	kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce	non-zero if ack=-1 is used
Requests waiting in the fetch purgatory 请求在purgatory中等待	kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch	size depends on fetch.wait.max.ms in the consumer
Request total time 请求总时间	kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce\|FetchConsumer\|FetchFollower}	broken into queue, local, remote and response send time
Time the request waits in the request queue 请求在请求队列中等待的时间	kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time the request is processed at the leader leader处理请求的时间	kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time the request waits for the follower 请求等待follower的时间	kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce\|FetchConsumer\|FetchFollower}	non-zero for produce requests when ack=-1
Time the request waits in the response queue 请求在响应队列中等待的时间	kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time to send the response 发送回应的时间	kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker. 消费者落后于生产者的消息数。由消费者而非broker提供。	kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idle 网络处理空闲的平均时间	kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent	between 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后将连接超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的连接数	kafka.server:type=socket-server-metrics,listener=[SASL_PLAINTEXT\|SASL_SSL],networkProcessor=<#>,name=expired-connections-killed-count	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后在其过期时间之后使用该连接进行除重新身份验证以外的任何操作时，所有处理器之间断开连接的总数	kafka.network:type=SocketServer,name=ExpiredConnectionsKilledCount	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idle 请求处理程序线程空闲的平均时间百分比	kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent	between 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的带宽配额指标	kafka.server:type={Produce\|Fetch},user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的请求配额指标	kafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttling 请求不受限制	kafka.server:type=Request	exempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latency ZooKeeper客户端请求延迟	kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs	Latency in millseconds for ZooKeeper requests from broker.
ZooKeeper connection status ZooKeeper连接状态	kafka.server:type=SessionExpireListener,name=SessionState	Connection status of broker's ZooKeeper session which may be one of Disconnected\|SyncConnected\|AuthFailed\|ConnectedReadOnly\|SaslAuthenticated\|Expired.
Max time to load group metadata 加载组元数据的最长时间	kafka.server:type=group-coordinator-metrics,name=partition-load-time-max	maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load group metadata 加载组元数据的平均时间	kafka.server:type=group-coordinator-metrics,name=partition-load-time-avg	average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Max time to load transaction metadata 加载交易元数据的最长时间	kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-max	maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load transaction metadata 加载交易元数据的平均时间	kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-avg	average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds

生产者/消费者/连接器共同的监控指标

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下指标可用于生产者/消费者/连接器实例。有关具体的指标。请查看以下部分。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
connection-close-rate	Connections closed per second in the window. 窗口每秒关闭的连接。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
connection-creation-rate	New connections established per second in the window. 窗口每秒建立的新连接。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
network-io-rate	The average number of network operations (reads or writes) on all connections per second. 所有连接每秒的平均网络操作数（读取或写入）。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
outgoing-byte-rate	The average number of outgoing bytes sent per second to all servers. 每秒向所有服务器发送的传出字节的平均数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-rate	The average number of requests sent per second. 每秒发送请求的平均数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-size-avg	The average size of all requests in the window. 窗口所有请求的平均大小。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-size-max	The maximum size of any request sent in the window. 窗口发送请求的最大值。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
incoming-byte-rate	Bytes/second read off all sockets. 字节/秒读取所有socket。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
response-rate	Responses received sent per second. 每秒响应收到的发送	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
select-rate	Number of times the I/O layer checked for new I/O to perform per second. I/O层每秒检查新I/O执行的次数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-wait-time-ns-avg	The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds. I/O线程花费在等待以纳秒为单位准备好读取或写入的socket的平均时间长度。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-wait-ratio	The fraction of time the I/O thread spent waiting. I/O线程花费等待的时间的比例。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-time-ns-avg	The average length of time for I/O per select call in nanoseconds. 每个选择调用的I/O的平均时间长度（以纳秒为单位）。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-ratio	The fraction of time the I/O thread spent doing I/O. I/O线程用于执行I/O的时间比例。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
connection-count	The current number of active connections. 当前活跃的连接数	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)

每个broker的生产者/消费者/连接器的公共指标（Common Per-broker metrics for producer/consumer/connect）

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下可用于生产者/消费者/连接器实例。有关具体指标，请参阅以下部分。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
outgoing-byte-rate	The average number of outgoing bytes sent per second for a node. 每个节点每秒传出字节的平均数。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-rate	The average number of requests sent per second for a node. 每个节点每秒发送的平均请求数。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-avg	The average size of all requests in the window for a node. 每个节点窗口所有请求平均大小。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-max	The maximum size of any request sent in the window for a node. 每个节点窗口发送请求最大值。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
incoming-byte-rate	The average number of responses received per second for a node. 每个节点接收响应的平均时间。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-avg	The average request latency in ms for a node. 节点等待平均请求延迟（毫秒）	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-max	The maximum request latency in ms for a node. 节点的请求最大延迟。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
response-rate	Responses received sent per second for a node. 节点每秒接收发送的响应。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

生产者监控（Producer monitoring）

The following metrics are available on producer instances.
以下指数可用于生产实例。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
waiting-threads	The number of user threads blocked waiting for buffer memory to enqueue their records. 用户线程数，阻塞等待缓冲内存消息入队。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-total-bytes	The maximum amount of buffer memory the client can use (whether or not it is currently used). 客户端可以使用的最大缓冲区内存（无论目前是否使用）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-available-bytes	The total amount of buffer memory that is not being used (either unallocated or in the free list). 未使用的缓冲内存总量（未分配或在空闲列表中）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
bufferpool-wait-time	The fraction of time an appender waits for space allocation. appender等待空间分配的时间比率。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-avg	The average number of bytes sent per partition per-request. 每个分区每个请求发送的平均字节数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-max	The max number of bytes sent per partition per-request. 每个分区每个请求发送的最大字节数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
compression-rate-avg	The average compression rate of record batches. 消息批次的平均压缩比率	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-avg	The average time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的平均时间（毫秒）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-max	The maximum time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的最大时间（毫秒）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-avg	The average request latency in ms. 请求平均延迟（毫秒）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-max	The maximum request latency in ms. 最大请求延迟（毫秒）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate	The average number of records sent per second. 每秒发送的消息平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
records-per-request-avg	The average number of records per request. 每个请求的平均消息数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-retry-rate	The average per-second number of retried record sends. 每秒重试消息发送的平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-error-rate	The average per-second number of record sends that resulted in errors. 引起错误的消息发送的每秒平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-max	The maximum record size. 最大消息大小	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-avg	The average record size. 平均消息大小	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
requests-in-flight	The current number of in-flight requests awaiting a response. 等待响应的当前请求数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
metadata-age	The age in seconds of the current producer metadata being used. 当前生产者元数据已使用的时间（以秒为单位）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate	The average number of records sent per second for a topic. topic每秒发送的平均消息数。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
byte-rate	The average number of bytes sent per second for a topic. topic每秒发送的平均字节数	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
compression-rate	The average compression rate of record batches for a topic. topic的消息批次的平均压缩比率。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-retry-rate	The average per-second number of retried record sends for a topic. topic发送重试消息的每秒平均数	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-error-rate	The average per-second number of record sends that resulted in errors for a topic. topic引起错误的发送每秒平均数。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
produce-throttle-time-max	The maximum time in ms a request was throttled by a broker. broker限制请求的最打时间（以毫秒为单位）	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)
produce-throttle-time-avg	The average time in ms a request was throttled by a broker. broker限制请求的平均时间（以毫秒为单位）	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)

新消费者监控（New consumer monitoring）

The following metrics are available on new consumer instances.
以下指标适用于新的消费者实例。

消费者组指标（Consumer Group Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
commit-latency-avg	The average time taken for a commit request 提交请求所需的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-latency-max	The max time taken for a commit request 提交请求所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-rate	The number of commit calls per second 每秒调用提交数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
assigned-partitions	The number of partitions currently assigned to this consumer 当前分配给此消费者的分区数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-response-time-max	The max time taken to receive a response to a heartbeat request 接收心跳请求响应所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-rate	The average number of heartbeats per second 每秒心跳的平均数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-avg	The average time taken for a group rejoin group重新加入所需要的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-max	The max time taken for a group rejoin group重新加入的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-rate	The number of group joins per second 每秒加入的group数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-avg	The average time taken for a group sync group同步所需的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-max	The max time taken for a group sync group同步所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-rate	The number of group syncs per second 每秒group同步数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
last-heartbeat-seconds-ago	The number of seconds since the last controller heartbeat 上次控制器心跳之后的秒数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

消费者拉取指标（Consumer Fetch Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
fetch-size-avg	The average number of bytes fetched per request 每个请求拉取的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-size-max	The maximum number of bytes fetched per request 每次请求拉取的最大字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
bytes-consumed-rate	The average number of bytes consumed per second 每秒消费的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-per-request-avg	The average number of records in each request 每个请求的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-consumed-rate	The average number of records consumed per second 每秒消费的消息平均数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-avg	The average time taken for a fetch request 拉取请求所需的平均时间	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-max	The max time taken for a fetch request 拉取请求所需的最大时间	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-rate	The number of fetch requests per second 每秒拉取请求数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-lag-max	The maximum lag in terms of number of records for any partition in this window 此窗口中任何分区消息数的最大落后	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-avg	The average throttle time in ms 平均限制时间（毫秒）	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-max	The maximum throttle time in ms 最大限流时间（毫秒）	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

topic级别拉取指标（Topic-level Fetch Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
fetch-size-avg	The average number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
fetch-size-max	The maximum number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的最大数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
bytes-consumed-rate	The average number of bytes consumed per second for a specific topic. 特定topic每秒消费的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-per-request-avg	The average number of records in each request for a specific topic. 特定topic每个请求的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-consumed-rate	The average number of records consumed per second for a specific topic. 特定topic每秒消费的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

其他方面（Others）

We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the client side, we recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
我们建议监控GC时间和其他统计信息以及各种服务器状态，例如CPU利用率，I/O服务时间等。客户端方面，我们建议监控消息/字节速率（全局和每个topic），请求速率/大小/ 时间，并且在消费者方面，在所有分区之间的消息中的最大滞后和最小获取请求速率。对于消费者来说，最大落后需要小于阈值，并且最少拉取速率需要大于0。

审计（Audit）

The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.
我们最后提醒的是数据传输的正确性。我们审核发送的每条消息都由所有消费者消费，并估算发生这种情况的落后。对于重要的topic，我们提醒，如果在一定时间内没有达到某种完整性。详细内容在KAFKA-260中讨论。

KafkaOffsetMonitor：监控消费者和延迟的队列

一个小应用程序来监视kafka消费者的进度和它们的延迟的队列。

KafkaOffsetMonitor是用来实时监控Kafka集群中的consumer以及在队列中的位置（偏移量）。

你可以查看当前的消费者组，每个topic队列的所有partition的消费情况。可以很快地知道每个partition中的消息是否很快被消费以及相应的队列消息增长速度等信息。这些可以debug kafka的producer和consumer，你完全知道你的系统将会发生什么。

这个web管理平台保留的partition offset和consumer滞后的历史数据（具体数据保存多少天我们可以在启动的时候配置），所以你可以很轻易了解这几天consumer消费情况。

KafkaOffsetMonitor这款软件是用Scala代码编写的，消息等历史数据是保存在名为offsetapp.db数据库文件中，该数据库是SQLLite文件，非常的轻量级。虽然我们可以在启动KafkaOffsetMonitor程序的时候指定数据更新的频率和数据保存的时间，但是不建议更新很频繁，或者保存大量的数据，因为在KafkaOffsetMonitor图形展示的时候会出现图像展示过慢，或者是直接导致内存溢出了。

所有的关于消息的偏移量、kafka集群的数量等信息都是从Zookeeper中获取到的，日志大小是通过计算得到的。

消费者组列表

screenshot

消费组的topic列表

screenshot

图中参数含义解释如下：

topic：创建时topic名称
partition：分区编号
offset：表示该parition已经消费了多少条message
logSize：表示该partition已经写了多少条message
Lag：表示有多少条message没有被消费。
Owner：表示消费者
Created：该partition创建时间
Last Seen：消费状态刷新最新时间。

topic的历史位置

screenshot

Offset存储位置

kafka能灵活地管理offset，可以选择任意存储和格式来保存offset。KafkaOffsetMonitor目前支持以下流行的存储格式。

kafka0.8版本以前，offset默认存储在zookeeper中（基于Zookeeper）
kafka0.9版本以后，offset默认存储在内部的topic中（基于Kafka内部的topic）
Storm Kafka Spout（默认情况下基于Zookeeper）

KafkaOffsetMonitor每个运行的实例只能支持单一类型的存储格式。

下载

可以到github下载KafkaOffsetMonitor源码。

https://github.com/quantifind/KafkaOffsetMonitor

编译KafkaOffsetMonitor命令：

sbt/sbt assembly

不过不建议你自己去下载，因为编译的jar包里引入的都是外部的css和js，所以打开必须联网，都是国外的地址，你编译的时候还要修改js路径，我已经搞定了，你直接下载就好了。

百度云盘：https://pan.baidu.com/s/1kUZJrCV

启动

编译完之后，将会在KafkaOffsetMonitor根目录下生成一个类似KafkaOffsetMonitor-assembly-0.3.0-SNAPSHOT.jar的jar文件。这个文件包含了所有的依赖，我们可以直接启动它：

java -cp KafkaOffsetMonitor-assembly-0.3.0-SNAPSHOT.jar \
     com.quantifind.kafka.offsetapp.OffsetGetterWeb \
     --offsetStorage kafka \
     --zk zk-server1,zk-server2 \
     --port 8080 \
     --refresh 10.seconds \
     --retain 2.days

启动方式2，创建脚本，因为您可能不是一个kafka集群。用脚本可以启动多个。

vim mobile_start_en.sh
        nohup java -Xms512M -Xmx512M -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -cp KafkaOffsetMonitor-assembly-0.3.0-SNAPSHOT.jar com.quantifind.kafka.offsetapp.OffsetGetterWeb 
       --offsetStorage kafka
       --zk 127.0.0.1:2181  
       --port 8080      
       --refresh 10.seconds      
       --retain 2.days 1>mobile-logs/stdout.log 2>mobile-logs/stderr.log &

各个参数的含义：

offsetStorage：有效的选项是"zookeeper","kafka","storm"。0.9版本以后，offset存储的位置在kafka。
zk: zookeeper的地址
prot 端口号
refresh 刷新频率，更新到DB。
retain 保留DB的时间
dbName 在哪里存储记录（默认'offsetapp')

Kafka Manager

作为一个分布式的消息发布-订阅系统，Apache Kafka在 Yahoo内部已经被很多团队所使用，例如媒体分析团队就将其应用到了实时分析流水线中，同时，Yahoo整个Kafka集群处理的峰值带宽超过了 20Gbps（压缩数据）。为了让开发者和服务工程师能够更加简单地维护Kafka集群，Yahoo构建了一个基于Web的管理工具，称为Kafka Manager，日前该项目已经在GitHub上开源。

通过Kafka Manager用户能够更容易地发现集群中哪些主题或者分区分布不均匀，同时能够管理多个集群，能够更容易地检查集群的状态，能够创建主题，执行首选的副本选择，能够基于集群当前的状态生成分区分配，并基于生成的分配执行分区的重分配，此外，Kafka Manager还是一个非常好的可以快速查看集群状态的工具。

Kafka Manager使用Scala语言编写，其Web控制台基于Play Framework实现，除此之外，Yahoo还迁移了一些Apache Kafka的帮助程序以便能够与Apache Curator框架一起工作。

Kafka在雅虎

Kafka在雅虎内部被很多团队使用，媒体团队用它做实时分析流水线，可以处理高达20Gbps（压缩数据）的峰值带宽。

为了简化开发者和服务工程师维护Kafka集群的工作，构建了一个叫做Kafka管理器的基于Web工具，叫做 Kafka Manager。这个管理工具可以很容易地发现分布在集群中的哪些topic分布不均匀，或者是分区在整个集群分布不均匀的的情况。它支持管理多个集群、选择副本、副本重新分配以及创建Topic。同时，这个管理工具也是一个非常好的可以快速浏览这个集群的工具。

该软件是用Scala语言编写的。目前(2015年02月03日)雅虎已经开源了Kafka Manager工具。这款Kafka集群管理工具主要支持以下几个功能：

管理几个不同的集群；
很容易地检查集群的状态(topics, brokers, 副本的分布, 分区的分布)；
选择副本；
产生分区分配(Generate partition assignments)基于集群的当前状态；
重新分配分区。

以下是该集群管理工具的截图：

Cluster Management

cluster

Topic List

topic

Topic View

topic

Consumer List View

consumer

Consumed Topic View

consumer

Broker List

broker

Broker View

broker

安装要求

Kafka 0.8.. or 0.9.. or 0.10.. or 0.11..
Java 8+
sbt 0.13.x

配置

系统至少需要配置zookeeper集群的地址，可以在kafka-manager安装包的conf目录下面的application.conf文件中进行配置。例如：

kafka-manager.zkhosts="my.zookeeper.host.com:2181"

你可以指定多个zookeeper地址，用逗号分隔：

kafka-manager.zkhosts="my.zookeeper.host.com:2181,other.zookeeper.host.com:2181"

另外, 如果你不想硬编码，可以使用环境变量ZK_HOSTS。

ZK_HOSTS="my.zookeeper.host.com:2181"

你可以启用/禁止以下的功能，通过修改application.config:

application.features=["KMClusterManagerFeature","KMTopicManagerFeature","KMPreferredReplicaElectionFeature","KMReassignPartitionsFeature"]

KMClusterManagerFeature - 允许从Kafka Manager添加，更新，删除集群。
KMTopicManagerFeature - 允许从Kafka集群中增加，更新，删除topic
KMPreferredReplicaElectionFeature - 允许为Kafka集群运行首选副本
KMReassignPartitionsFeature - 允许生成分区分配和重新分配分区

考虑为启用了jmx的大群集设置这些参数：

kafka-manager.broker-view-thread-pool-size=< 3 * number_of_brokers>
kafka-manager.broker-view-max-queue-size=< 3 * total # of partitions across all topics>
kafka-manager.broker-view-update-seconds=< kafka-manager.broker-view-max-queue-size / (10 * number_of_brokers) >

下面是一个包含10个broker，100个topic的kafka集群示例，每个topic有10个分区，相当于1000个总分区，并启用JMX：

kafka-manager.broker-view-thread-pool-size=30
kafka-manager.broker-view-max-queue-size=3000
kafka-manager.broker-view-update-seconds=30

控制消费者偏offset缓存的线程池和队列：

kafka-manager.offset-cache-thread-pool-size=< default is # of processors>
kafka-manager.offset-cache-max-queue-size=< default is 1000>
kafka-manager.kafka-admin-client-thread-pool-size=< default is # of processors>
kafka-manager.kafka-admin-client-max-queue-size=< default is 1000>

您应该在启用了消费者轮询的情况下为大量#消费者增加以上内容。虽然它主要影响基于ZK的消费者轮询。

Kafka管理的消费者offset现在由“__consumer_offsets”topic中的KafkaManagedOffsetCache消费。请注意，这尚未经过跟踪大量offset的测试。每个集群都有一个单独的线程消费这个topic，所以它可能无法跟上被推送到topic的大量offset。

部署

下面的命令创建一个可部署应用的zip文件。

sbt clean dist

如果你不想拉源码，在编译，我已经编译好，放在百度云盘上了。

https://pan.baidu.com/s/1geEB1rt

启动服务

解压刚刚的zip文件,然后启动它:

$ bin/kafka-manager

默认情况下，端口为9000。可覆盖，例如：

$ bin/kafka-manager -Dconfig.file=/path/to/application.conf -Dhttp.port=8080

再如果java不在你的路径中，或你需要针对不同的版本，增加-java-home选项：

$ bin/kafka-manager -java-home /usr/local/oracle-java-8

用安全启动服务

为SASL添加JAAS配置，添加配置文件位置：

$ bin/kafka-manager -Djava.security.auth.login.config=/path/to/my-jaas.conf

注意：确保运行kafka manager的用户有读取jaas配置文件的权限。

打包

如果你想创建一个Debian或者RPM包，你可以使用下面命令打包：

sbt debian:packageBin
sbt rpm:packageBin

kafka投入运行的ZooKeeper

Operationally, we do the following for a healthy ZooKeeper installation:
在操作上，我们做以下健康的zookeeper安装方式：

Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don't go nuts) hardware, try to keep redundant power and network paths, etc.
在物理，硬件，网络布局的冗余：尽量不要把它们放在同一机架内，良好的（但不要发疯）硬件，尽量保持冗余的电源和网络路径等等.
I/O segregation: if you do a lot of write type traffic you'll almost definitely want the transaction logs on a different disk group than application logs and snapshots (the write to the ZooKeeper service has a synchronous write to disk, which can be slow).
I/0隔离：如果你有很多写操作，你最好把事务日志放在与应用日志、快照不同的一个磁盘组上。（写到zookeeper服务器会同步写磁盘，会慢点。）
Application segregation: Unless you really understand the application patterns of other apps that you want to install on the same box, it can be a good idea to run ZooKeeper in isolation (though this can be a balancing act with the capabilities of the hardware).
应用隔离：zookeeper不要和别的应用安装在一起，除非你真的了解你想要安装在同一机器中其他应用的模式，最好单独运行zookeeper(尽管zookeeper是可以平衡使用硬件的性能的)
Use care with virtualization: It can work, depending on your cluster layout and read/write patterns and SLAs, but the tiny overheads introduced by the virtualization layer can add up and throw off ZooKeeper, as it can be very time sensitive
小心使用虚拟化：可以使用，这取决于你的集群布局和read/write模式和SLA，但是微小的虚拟化层引入的开销加起来会断开zookeeper，因为它非常的敏感。
ZooKeeper configuration and monitoring: It's java, make sure you give it 'enough' heap space (We usually run them with 3-5G, but that's mostly due to the data set size we have here). Unfortunately we don't have a good formula for it. As far as monitoring, both JMZ and the 4 letter commands are very useful, they do overlap in some cases (and in those cases we prefer the 4 letter commands, they seem more predictable, or at the very least, they work better with the LI monitoring infrastructure)
zookeeper配置和监控：它是java，首先确保你给它足够的堆空间（我们通常设置3-5G，这个配置主要是根据我们的情况下配置的），不幸的时我们没有它算法公式。至于监控，JMZ和4个字母的命令非常有用，它们某些情况下重叠（在这种情况下，我们更喜欢4个字母的命令，他们似乎更可预测的，或者至少，它们与LI监控基础设施更好的工作)
Don't overbuild the cluster: large clusters, especially in a write heavy usage pattern, means a lot of intracluster communication (quorums on the writes and subsequent cluster member updates), but don't underbuild it (and risk swamping the cluster).
不要过度建设集群：大型集群，尤其是在写入沉重的使用模式，意味着很多内通讯（规定人数的写入和后续的集群成员的更新），但不underbuild它（和风险覆盖集群）。
Try to run on a 3-5 node cluster: ZooKeeper writes use quorums and inherently that means having an odd number of machines in a cluster. Remember that a 5 node cluster will cause writes to slow down compared to a 3 node cluster, but will allow more fault tolerance.
尝试在3 - 5个节点集群上运行：Zookeeper写入使用规定人数，本质上这意味着在集群中有个奇数的机器。要知道，一个5节点的集群比3节点集群要慢，但将允许更多的容错能力。

Overall, we try to keep the ZooKeeper system as small as will handle the load (plus standard growth capacity planning) and as simple as possible. We try not to do anything fancy with the configuration or application layout as compared to the official release as well as keep it as self contained as possible. For these reasons, we tend to skip the OS packaged versions, since it has a tendency to try to put things in the OS standard hierarchy, which can be 'messy', for want of a better way to word it.

总体来看，我们尽量保持zookeeper尽可能小的处理负载（标准增长容量规划）并尽可能的简单。我们尽量不做什么花里胡哨的配置或应用程序的布局，相比，我们尽可能的保持使用官方版本的发布。基于这些原因，我们倾向于跳过操作系统打包的版本，因为它会把焦点集中在操作系统标准层次结构中。

kafka重要的客户端配置

The most important producer configurations control
最重要的生产配置控制

压缩
同步生产 vs 异步生产
批处理大小（异步生产）

The most important consumer configuration is the fetch size.
最重要的consumer配置是获取消息的大小。

kafka生产服务器配置

生产者服务器配置

Here is our server production server configuration:
这是我们的服务器生产服务器配置:

# Replication configurations
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000

controller.socket.timeout.ms=30000
controller.message.queue.size=10

# Log configuration
num.partitions=8
message.max.bytes=1000000
auto.create.topics.enable=true
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.hours=168
log.flush.interval.ms=10000
log.flush.interval.messages=20000
log.flush.scheduler.interval.ms=2000
log.roll.hours=168
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824

# ZK configuration
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

# Socket server configuration
num.io.threads=8
num.network.threads=8
socket.request.max.bytes=104857600
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=16
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100

Our client configuration varies a fair amount between different use cases.
我们的客户端根据不同的使用情况调整数量。

kafka的Java版本

From a security perspective, we recommend you use the latest released version of JDK 1.8 as older freely available versions have disclosed security vulnerabilities. LinkedIn is currently running JDK 1.8 u5 (looking to upgrade to a newer version) with the G1 collector. If you decide to use the G1 collector (the current default) and you are still on JDK 1.7, make sure you are on u51 or newer. LinkedIn tried out u21 in testing, but they had a number of problems with the GC implementation in that version. LinkedIn's tuning looks like this:

从安全的角度，我们推荐你使用最新的发布版本JDK1.8，旧版本已经公开披露了一些安全漏洞，LinkedIn现在正在运行的是JDK 1.8 u5（希望升级到新版本）使用G1收集器，如果你想在在JDK 1.7使用G1收集器（当前默认），请确保在u51或更高的版本，LinkedIn尝试在u21测试，但该版本存在大量G1执行的问题。LinkedIn的配置如下：

-Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M
  -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80

For reference, here are the stats on one of LinkedIn's busiest clusters (at peak):

供参考，下面是关于 LinkedIn 的繁忙集群 (高峰) 之一的统计：

- 60 brokers

- 60个brokers

- 50k partitions (replication factor 2)

- 50k 分区 (副本 2)

- 800k messages/sec in

-800k 消息/秒

- 300 MB/sec inbound, 1 GB/sec+ outbound

300 MB/sec的入站, 1 GB/秒+ 出站

The tuning looks fairly aggressive, but all of the brokers in that cluster have a 90% GC pause time of about 21ms, and they're doing less than 1 young GC per second.

这个调整看来相当激进, 但是集群中的有90%的GC暂停时间大约是21ms, 以及每秒小于1 个的年轻代GC.

kafka安全概述

7.1 安全概述

在发布的 0.9.0.0 版本中，kafka增加了许多安全性功能，目前支持以下的安全措施，可以组合使用也可以分开使用。

broker使用SSL或SASL(Kerberos)，验证客户端（生产者或消费者）、其他broker或工具的连接。支持以下的SASL机制：
- SASL/GSSAPI (Kerberos) - 从0.9.0.0版本开始
- SASL/PLAIN - 从0.10.0.0版本开始
- SASL/SCRAM-SHA-256 和 SASL/SCRAM-SHA-512 - 从0.10.2.0版本开始
- SASL/OAUTHBEARER - 从2.0版本开始
从broker连接到Zookeeper的身份验证。
broker和client之间的数据传输，broker之间，或使用SSL的broker和工具之间的数据加密（注意，当SSL时，性能会降低，其幅度取决于CPU类型和JVM）。
client的read/write操作验证。
验证是插拔的，支持外部认证服务集成。

值得注意的是，安全是可选的。支持非安全集群，以及经过身份验证，未认证，加密和未加密客户端的组合。下面的指南将介绍如何配置和使用client和broker的安全特性。

注意：

如果对加密基础不牢靠的，可以先看看这篇文章:

kafka使用SSL加密和认证

Apache kafka 允许clinet通过SSL连接，SSL默认是不可用的，需手动开启。

1. 为每个Kafka broker生成SSL密钥和证书。

部署HTTPS，第一步是为集群的每台机器生成密钥和证书，可以使用java的keytool来生产。我们将生成密钥到一个临时的密钥库，之后我们可以导出并用CA签名它。

keytool -keystore server.keystore.jks -alias localhost -validity {validity} -genkey

你需要用上面命令来指定两个参数。

keystore: 密钥仓库存储证书文件。密钥仓库文件包含证书的私钥（保证私钥的安全）。
validity: 证书的有效时间，天

配置主机名验证

从Kafka 2.0.0版开始，默认情况下客户端连接以及broker之间的连接启用了服务器的主机名验证，为防止中间人攻击。可以通过将ssl.endpoint.identification.algorithm设置为空字符串来关闭服务器主机名验证。例如，

ssl.endpoint.identification.algorithm=

对于动态配置的broker侦听，可以使用kafka-configs.sh禁用主机名验证。如，

bin/kafka-configs.sh --bootstrap-server localhost:9093 --entity-type brokers --entity-name 0 --alter --add-config "listener.name.internal.ssl.endpoint.identification.algorithm="

对于老版本的Kafka，默认情况下未定义ssl.endpoint.identification.algorithm，因此不会执行主机名验证。通过设置为HTTPS以启用主机名验证。

ssl.endpoint.identification.algorithm=HTTPS

如果服务器端没有进行外部验证，则必须启用主机名验证，以防止中间人攻击。

在证书中配置主机名

如果启用了主机名验证，客户端将根据以下两个字段之一验证服务器的完全限定域名（FQDN）：

Common Name (CN)
Subject Alternative Name (SAN)

这两个字段均有效，但是RFC-2818建议使用SAN。SAN更加的灵活，允许声明多个DNS条目。另一个优点是，出于授权目的，可以将CN设置为更有意义的值。要添加SAN，需将以下参数-ext SAN=DNS:{FQDN}追加到keytool命令：

keytool -keystore server.keystore.jks -alias localhost -validity {validity} -genkey -keyalg RSA -ext SAN=DNS:{FQDN}

然后可以运行以下命令来验证生成的证书的内容：

keytool -list -v -keystore server.keystore.jks

2. 创建自己的CA

通过第一步，集群中的每台机器都生成一对公私钥，和一个证书来识别机器。但是，证书是未签名的，这意味着攻击者可以创建一个这样的证书来伪装成任何机器。

因此，通过对集群中的每台机器进行签名来防止伪造的证书。证书颁发机构（CA）负责签名证书。CA的工作机制像一个颁发护照的政府。政府印章（标志）每本护照，这样护照很难伪造。其他政府核实护照的印章，以确保护照是真实的。同样，CA签名的证书和加密保证签名证书很难伪造。因此，只要CA是一个真正和值得信赖的权威，client就能有较高的保障连接的是真正的机器。

openssl req -new -x509 -keyout ca-key -out ca-cert -days 365

生成的CA是一个简单的公私钥对和证书，用于签名其他的证书。

下一步是将生成的CA添加到**clients' truststore（客户的信任库）**，以便client可以信任这个CA:

keytool -keystore client.truststore.jks -alias CARoot -import -file ca-cert

注意，还需设置ssl.client.auth(”requested" 或 "required”)，来要求broker对客户端连接进行验证，当然，你必须为broker提供信任库以及所有客户端签名了密钥的CA证书，通过下面的命令：

keytool -keystore server.truststore.jks -alias CARoot -import -file ca-cert

相反，在步骤1中，密钥库存储了每个机器自己的身份。客户端的信任库存储所有客户端信任的证书，将证书导入到一个信任仓库也意味着信任由该证书签名的所有证书，正如上面的比喻，信任政府（CA）也意味着信任它颁发的所有护照（证书），此特性称为信任链，在大型的kafka集群上部署SSL时特别有用的。可以用单个CA签名集群中的所有证书，并且所有的机器共享相同的信任仓库，这样所有的机器也可以验证其他的机器了。

3. 签名证书

用步骤2生成的CA来签名所有步骤1生成的证书，首先，你需要从密钥仓库导出证书：

keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file

然后用CA签名：

openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days {validity} -CAcreateserial -passin pass:{ca-password}

最后,你需要导入CA的证书和已签名的证书到密钥仓库:

keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert
keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed

参数的定义如下：

keystore: 密钥仓库的位置
ca-cert: CA的证书
ca-key: CA的私钥
ca-password: CA的密码
cert-file: 出口，服务器的未签名证书
cert-signed: 已签名的服务器证书

这是上面所有步骤的一个bash脚本例子。注意，这里假设密码“test1234”。

#!/bin/bash
#Step 1
keytool -keystore server.keystore.jks -alias localhost -validity 365 -genkey
#Step 2
openssl req -new -x509 -keyout ca-key -out ca-cert -days 365
keytool -keystore server.truststore.jks -alias CARoot -import -file ca-cert
keytool -keystore client.truststore.jks -alias CARoot -import -file ca-cert
#Step 3
keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file
openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days 365 -CAcreateserial -passin pass:test1234
keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert
keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed

4. 配置Kafka Broker

Kafka Broker支持监听多个端口上的连接，通过server.properteis 配置，最少监听1个端口，用逗号分隔。

listeners

如果broker之间通讯未启用SSL（参照下面，启动它），PLAINTEXT和SSL端口是必须要配置。

listeners=PLAINTEXT://host.name:port,SSL://host.name:port

下面是broker端需要的SSL配置，

ssl.keystore.location=/var/private/ssl/server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/private/ssl/server.truststore.jks
ssl.truststore.password=test1234

注意：ssl.truststore.password在是可选的，但强烈建议使用。如果未设置密码，则仍然可以访问信任库，但就不是完整性的检查了。

其他还有一些的可选的配置：

ssl.client.auth = none (“required”=>客户端身份验证是必需的，“requested”=>客户端身份验证请求，客户端没有证书仍然可以连接。使用“requested”是纸老虎，因为它提供了一种虚假的安全感，错误的配置客户端仍将连接成功。)
ssl.cipher.suites（可选）。密码套件是利用TLS或SSL网络协议的网络连接的安全设置。是认证，加密，MAC和密钥交换算法的组合。（默认值是一个空表）
ssl.enabled.protocols = TLSv1.2 TLSv1.1 TLSv1 (接收来自客户端列出的SSL协议，注意，不推荐在生产中使用SSL，推荐使用TLS)。
ssl.keystore.type=JKS
ssl.truststore.type=JKS
ssl.secure.random.implementation=SHA1PRNG

如果你想启用SSL用于broker内部通讯，将以下内容添加到broker配置文件（默认是PLAINTEXT）

security.inter.broker.protocol=SSL

由于一些国家的进口规定，oracle的实现限制了默认的加密算法的强度。如果需要更强的算法（例如AES 256位密钥），该JCE Unlimited Strength Jurisdiction Policy Files必须获得并安装JDK和JRE。更多信息参见JCA文档。

JRE/JDK有一个默认伪随机数生成者（PRNG）,用于加密操作。因此，不需要用

ssl.secure.random.implementation

实现。然而，有些实现存在性能问题（尤其是在Linux系统选择的默认值

NativePRNG

，使用全局锁）。在SSL连接的性能出现问题的情况下，请考虑明确设置要使用的实现。如

SHA1PRNG

实现是非阻塞的，并且在高负载（50MB/秒的生成消息，以及每个代理的复制流量）下显示出非常好的性能特征。

一旦你启动broker，你应该就能在server.log看到

with addresses: PLAINTEXT -> EndPoint(192.168.64.1,9092,PLAINTEXT),SSL -> EndPoint(192.168.64.1,9093,SSL)

用以下命令，快速验证服务器的keystore和truststore设置是否正确：

openssl s_client -debug -connect localhost:9093 -tls1

(注意: TLSv1 应列出 ssl.enabled.protocols)
在命令的输出中，你应该能看到服务器的证书:

如果证书没有出现或者有任何其他错误信息，那么你的keystore设置不正确。

5. 配置Kafka客户端

SSL仅支持新的Producer和Consumer，不支持老的API，Producer和Consumer的SSL的配置是相同的。

如果broker中不需要client（客户端）验证，那么下面是最小的配置示例：

security.protocol=SSL
ssl.truststore.location=/var/private/ssl/client.truststore.jks
ssl.truststore.password=test1234

注意：ssl.truststore.password在技术上是可选的，但强烈推荐。如果未设置密码，对信任库的访问仍然可用，就不属于完整性检查。如果需要客户端认证，则必须像步骤1一样创建密钥库，并且还必须配置以下内容：

ssl.keystore.location=/var/private/ssl/client.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234

也可以根据我的需求在broker上设置其他的配置：

ssl.provider (可选的). 用于SSL连接的安全提供程序名称，默认值是JVM的默认的安全提供程序。
ssl.cipher.suites (可选).密码套件是身份验证、加密、 MAC 和密钥交换算法用于协商使用 TLS 或 SSL 网络协议的网络连接的安全设置的已命名的组合。
ssl.enabled.protocols=TLSv1.2，TLSv1.1，TLSv1。broker配置协议（至少一个）。
ssl.truststore.type=JKS
ssl.keystore.type=JKS

举个console-produce和console-consumer的例子:

kafka-console-producer.sh --broker-list localhost:9093 --topic test --producer.config client-ssl.properties
kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic test --consumer.config client-ssl.properties

实战笔记

kafka实战ssl

Kafka SASL验证

1. JAAS配置

Kafka使用Java认证和授权服务（JAAS）进行SASL配置。

为kafka broker配置JAAS

KafkaServer是每个KafkaServer/Broker使用的JAAS文件中的名称。本节提供broker的SASL配置选项，包括进行broker之间通信所需的SASL客户端连接。如果将多个listeners配置为SASL，则名称可以在listeners名称前以小写字母开头，后跟一个句点，例如sasl_ssl.KafkaServer。

客户端部分用于验证与zookeeper的SASL连接。它还允许broker在zookeeper节点上设置SASL ACL。并锁定这些节点，以便只有broker可以修改它。所有的broker必须principal名称相同。如果要使用客户端以外的名称，设置zookeeper.sasl.client（例如，-Dzookeeper.sasl.clientconfig=ZkClient）。

默认情况下，Zookeeper使用 “zookeeper” 作为服务名称。如果你需要修改，设置zookeeper.sasl.client.user（例如，-Dzookeeper.sasl.client.username=zk）

Broker还可以使用sasl.jaas.config配置JAAS。属性名称必须以包括SASL机制的listener前缀为前缀，例如：listener.name.{listenerName}.{saslMechanism}.sasl.jaas.config只能指定一个登录模块。如果listener上配置了多种机制，则listener必须使用和机制前缀为每种机制提供的配置。例如：
```
listener.name.sasl_ssl.scram-sha-256.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
   username="admin" \
   password="admin-secret";
listener.name.sasl_ssl.plain.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
   username="admin" \
   password="admin-secret" \
   user_admin="admin-secret" \
   user_alice="alice-secret";
```
如果在不同级别定义了JAAS配置，则使用的优先级顺序为：
- broker配置属性listener.name.{listenerName}.{saslMechanism}.sasl.jaas.config
- 静态JAAS配置{listenerName}.KafkaServer
- 静态JAAS配置KafkaServer
请注意，只能使用静态JAAS配置来配置ZooKeeper JAAS配置。

有关broker配置的示例，请参见GSSAPI（Kerberos），PLAIN，SCRAM或OAUTHBEARER。
为Kafka client配置JAAS

客户端可以使用sasl.jaas.config或使用类似broker的静态JAAS配置文件配置JAAS。
1. 客户端的JAAS配置
  
  客户端可以将JAAS配置指定给生产者或消费者，无需创建物理配置文件。通过为每个客户端指定不同的属性，此模式还使同一JVM中的不同生产者和消费者可以使用不同的凭据。如果同时指定了静态JAAS配置系统属性java.security.auth.login.config和客户端属性sasl.jaas.config，则将会使用客户端的属性。
  
  请参见GSSAPI（Kerberos），PLAIN，SCRAM或OAUTHBEARER。
2. 静态文件配置JAAS
  
  使用静态JAAS配置文件来配置客户端上的SASL认证。
  1. 添加一个名为KafkaClient的客户端登录的JAAS配置文件。在KafkaClient中为所选机制配置登录模块，如设置GSSAPI（Kerberos），PLAIN或SCRAM的示例中所述。例如，GSSAPI凭据可以配置为：
```
KafkaClient {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 storeKey=true
 keyTab="/etc/security/keytabs/kafka_client.keytab"
 principal="kafka-client-1@EXAMPLE.COM";
};
```
  2. 将JAAS配置文件位置作为JVM参数传递给每个客户端JVM。例如：
```
-Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf
```

2. SASL配置

SASL可与PLAINTEXT或SSL一起或分别用作安全协议传输层（SASL_PLAINTEXT或SASL_SSL，如果使用SASL_SSL，则必须配置SSL）。

SASL机制

Kafka支持以下的SASL机制：
- GSSAPI (Kerberos)
- PLAIN
- SCRAM-SHA-256
- SCRAM-SHA-512
- OAUTHBEARER
为Kafka broker配置SASL
1. 在server.properteis配置一个SASL端口，SASL_PLAINTEXT或SASL_SSL添加到listeners中（至少一个），用逗号分隔：
```
listeners=SASL_PLAINTEXT://host.name:port
```
  如果你只配置一个SASL端口（或者如果你需要broker使用SASL互相验证），那么需要确保broker之间设置相同的SASL协议：
  
  security.inter.broker.protocol=SASL_PLAINTEXT (or SASL_SSL)
2. 选择一个或多个支持的机制，并通过以下的步骤为机器配置SASL。在broker之间启用多个机制：
为Kafka client配置SASL

SASL仅支持新的java生产者和消费者，不支持老的API。

要在客户端上配置SASL验证，选择在broker中启用的客户端身份验证的SASL机制，并按照以下步骤配置所选机制的SASL。

开始SASL认证

kafka使用SASL/Kerberos认证

1. 预备知识

Kerberos

如果你已在使用Kerberos（如：Active Directory），则无需安装重新安装。否则，你将需要安装一个Kerberos，Linux供应商有Kerberos安装和配置的简短说明（Ubuntu，Radhat）。请注意，如果你使用的是Oracle Java，你需要下载java版本的JCE策略文件，将它们复制到 $JAVA_HOME/jre/lib/security中（注意：必须替换！）.
创建Kerberos Principals

如果你使用的是公司的Kerberos或Active Directory服务器，请向Kerberos管理员询问群集中每个broker的principal以及将使用Kerberos验证（通过客户端和工具）访问Kafka的每个操作系统用户。

如果是你自己安装的Kerberos，你需要通过以下命令创建你自己的principal。
```
sudo /usr/sbin/kadmin.local -q 'addprinc -randkey kafka/{hostname}@{REALM}'
sudo /usr/sbin/kadmin.local -q "ktadd -k /etc/security/keytabs/{keytabname}.keytab kafka/{hostname}@{REALM}"
```
确保使用主机名可以访问所有主机 -- Kerberos要求所有的host都可以用其FQDN解析所有主机。

2. 配置Kafka Broker

添加一个JAAS文件，类似下面的每个kafka broker的配置目录。在本例中我们将其命名为kafka_server_jaas.conf（注意，每个broker都应该有自己的keytab）。

KafkaServer {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    storeKey=true
    keyTab="/etc/security/keytabs/kafka_server.keytab"
    principal="kafka/kafka1.hostname.com@EXAMPLE.COM";
};

// Zookeeper client authentication
Client {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    storeKey=true
    keyTab="/etc/security/keytabs/kafka_server.keytab"
    principal="kafka/kafka1.hostname.com@EXAMPLE.COM";
};

JAAS文件中的KafkaServer告诉broker哪个principal要使用，以及存储该principal的keytab的位置。它允许broker使用指定的keytab进行登录。

通过JAAS和krb5文件位置（可选的）作为JVM参数传递到每个broker。

 -Djava.security.krb5.conf=/etc/kafka/krb5.conf
    -Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf

确保在JAAS文件的keytabs配置文件可被启动的Broker的操作系统员读取。
在server.properties中配置SASL的端口和SASL机制，例如：
```
 listeners=SASL_PLAINTEXT://host.name:port
 security.inter.broker.protocol=SASL_PLAINTEXT
 sasl.mechanism.inter.broker.protocol=GSSAPI
 sasl.enabled.mechanisms=GSSAPI
```
我们还必须在server.properties配置服务器名称，应与broker的principal名匹配，在上面的例子中，principal是"kafka/kafka1.hostname.com@EXAMPLE.com", 所以：
```
sasl.kerberos.service.name=kafka
```

3. 配置Kafka Client

在客户端上配置SASL认证

客户端（生产者，消费者，connect，等等）用自己的principal进行集群认证（通常用相同名称作为运行客户端的用户）。因此，获取或根据需要创建这些principal。然后为每个客户端配置JAAS配置。JVM中的不同客户端通过指定不同的principal可以作为不同的用户运行。producer.properties或consumer.properties中的sasl.jaas.config描述了像生产者和消费者之类的客户端如何连接到Kafka Broker的。以下是使用keytab的客户端的示例配置（推荐用于长时间运行的进程）：
```
 sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
     useKeyTab=true \
     storeKey=true  \
     keyTab="/etc/security/keytabs/kafka_client.keytab" \
     principal="kafka-client-1@EXAMPLE.COM";
```
对于像kafka-console-consumer或kafka-console-producer这样的命令行工具，kinit可以与“useTicketCache=true”一起使用，如：
```
 sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
     useTicketCache=true;
```
客户端的JAAS配置可以作为JVM参数，类似于broker。客户端使用名为KafkaClient的login部分。此选项仅允许JVM中所有客户端连接的一个用户。
确保JAAS配置中的keytabs配置文件能被启动kafka客户端的操作系统用户读取。
可以将krb5文件位置作为JVM参数传递给每个客户端JVM：
```
 -Djava.security.krb5.conf=/etc/kafka/krb5.conf
```

在 producer.properties 或 consumer.properties中配置以下属性:

 security.protocol=SASL_PLAINTEXT (or SASL_SSL)
 sasl.mechanism=GSSAPI
 sasl.kerberos.service.name=kafka

实战笔记

kafka实战kerberos（笔记）

kafka使用SASL/PLAIN认证

SASL/PLAIN是一种简单的用户名/密码的认证机制，通常与TLS加密一起使用，以实现安全的认证。Kafka支持SASL/PLAIN的默认实现，可作为生产者的扩展使用。

username用作ACL等配置已认证的Principal。

1. 配置Kafka Brokers

在每个Kafka broker的config目录下添加一个类似于下面的修改后的JAAS文件，我们姑且将其称为kafka_server_jaas.conf。
```
KafkaServer { org.apache.kafka.common.security.plain.PlainLoginModule required
       username="admin"
       password="admin-secret"
       user_admin="admin-secret"
       user_alice="alice-secret";
       };
```
此配置定义了2个用户（admin 和 alice）。在KafkaServer中，username和password是broker用于初始化连接到其他的broker，在这个例子中，admin是broker之间通信的用户。user_userName定义了连接到broker的所有用户的密码，broker使用这些来验证所有客户端的连接，包括来自其他的broker的连接。
将JAAS配置文件位置作为JVM参数传递给每个Kafka broker：
```
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
```

在server.properties中配置SASL端口和SASL机制。例如：

listeners=SASL_SSL://host.name:port
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN

2. 配置kafka客户端

在客户端上配置SASL身份验证：

为producer.properties或consumer.properties中的每个客户端配置JAAS。登录模块展示了客户端如何连接Broker的（和生产者和消费者一样）。以下是PLAIN机制的客户端的示例配置：
```
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \  username="alice" \  password="alice-secret";
```
客户端选择用户名和密码为客户端配置连接的用户。在此示例中，客户端以用户alice连接到broker。也可以通过在sasl.jaas.config中指定不同的用户名和密码，JVM中的不同客户端可以根据不同的用户来进行连接。

客户端的JAAS配置可以指定为类似于这里描述的broker作为JVM参数。客户端使用的命名为KafkaClient。此选项仅允许来自JVM的所有客户端连接中的一个用户。
在producer.properties或consumer.properties中配置以下属性：
```
 security.protocol=SASL_SSL
 sasl.mechanism=PLAIN
```

3. 在生产者中使用SASL/PLAIN

SASL/PLAIN应仅用SSL作为传输层，以确保在没有加密的情况下不会在线上明文传输。
Kafka中SASL / PLAIN的默认实现在JAAS配置文件中指定用户名和密码，如下所示。从Kafka 2.0版开始，您可以通过使用配置sasl.server.callback.handler.class和sasl.client.callback.handler.class配置自己的回调处理程序来从外部源获取用户名和密码，从而避免在磁盘上存储明文密码。
在生产系统中，外部认证服务器可以实现密码认证。从Kafka 2.0版开始，可以通过配置sasl.server.callback.handler.class使用外部身份验证服务器进行密码验证的自己的回调处理程序。

实战笔记

kafka实战SASL/PLAIN认证

kafka使用SASL/SCRAM认证

SCRAM（Salted Challenge Response Authentication Mechanism）是SASL机制家族的一种，通过执行用户名/密码认证（如PLAIN和DIGEST-MD5）的传统机制来解决安全问题。该机制在RFC 5802中定义。Kafka支持SCRAM-SHA-256和SCRAM-SHA-512，可与TLS一起使用执行安全认证。用户名用作配置ACL等认证的Principal。Kafka中的默认SCRAM实现是在Zookeeper中存储SCRAM的证书，适用于Zookeeper在私有网络上的Kafka安装。有关详细信息，请参阅安全注意事项。

1. 创建 SCRAM 证书

Kafka的SCRAM实现使用Zookeeper作为证书存储。通过使用kafka-configs.sh来创建证书。对于启用的每个SCRAM机制，必须通过使用机制名称添加配置来创建证书。必须在kafka broker启动之前创建broker之间通信的证书。客户端证书可以动态创建和更新，并且将使用更新后的证书来验证新的连接。

为用户alice创建SCRAM凭证（密码为alice-secret）：

> bin/kafka-configs.sh --zookeeper localhost:2181 --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=alice-secret],SCRAM-SHA-512=[password=alice-secret]' --entity-type users --entity-name alice

如果未指定迭代数，则使用默认迭代数为4096。创建一个随机salt，由salt，迭代，StoredKey和ServerKey组成的SCRAM标识，都存储在Zookeeper中。有关SCRAM身份和各个字段的详细信息，请参阅RFC 5802。

以下示例中，需要用户admin进行broker间通信，通过以下命令创建：

bin/kafka-configs.sh --zookeeper localhost:2181 --alter --add-config 'SCRAM-SHA-256=[password=admin-secret],SCRAM-SHA-512=[password=admin-secret]' --entity-type users --entity-name admin

可以使用--describe列出现有的证书：

bin/kafka-configs.sh --zookeeper localhost:2181 --describe --entity-type users --entity-name alice

可以使用--delete为一个或多个SCRAM机制删除证书：

bin/kafka-configs.sh --zookeeper localhost:2181 --alter --delete-config 'SCRAM-SHA-512' --entity-type users --entity-name alice

2. 配置Kafka Broker

在每个Kafka broker的config目录下添加一个类似于下面的JAAS文件，我们姑且将其称为kafka_server_jaas.conf：
```
KafkaServer {
 org.apache.kafka.common.security.scram.ScramLoginModule required
 username="admin"
 password="admin-secret"
};
```
其中，broker使用KafkaServer中的用户名和密码来和其他broker进行连接。在这个例子中，admin是broker之间通信的用户。
JAAS配置文件的位置作为JVM参数传递给每个Kafka broker：
```
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
```

在server.properties中配置SASL端口和SASL机制。例如：

listeners=SASL_SSL://host.name:port
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256 (or SCRAM-SHA-512)
sasl.enabled.mechanisms=SCRAM-SHA-256 (or SCRAM-SHA-512)

3. 配置kafka客户端

在客户端上配置SASL认证

为每个客户端配置JAAS配置（在producer.properteis或consumer.properteis）。登录模块展示了客户端（如生产者和消费者）如何连接到broker的。下面是配置了SCRAM机制的客户端的例子。
```
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
username="alice" \
password="alice-secret";
```
客户端使用username和password来配置客户端连接的用户。在这个例子中，客户端使用用户alice连接broker。在JVM中不同的客户端连接不同的用户（通过在sasl.jaas.config中指定不同的用户名和密码）。

客户端的JAAS配置通过指定作为JVM的参数。使用名为KafkaCLient的客户端登录。此选择仅允许来自JVM的所有客户端连接中的一个用户。

在 producer.properties 或 consumer.properties 中配置以下参数：

 security.protocol=SASL_SSL
 sasl.mechanism=SCRAM-SHA-256 (or SCRAM-SHA-512)

4. SASL/SCRAM安全注意事项

在kafka中SASL/SCRAM的默认实现SCRAM证书存储在Zookeeper中，这适用于Zookeeper安全和私有网络的生产场景。
Kafka仅支持强散列函数SHA-256和SHA-512，最小迭代次数为4096.强散列函数结合强密码和高迭代数可以防止强制攻击（如果Zookeeper安全性受到威胁）。
SCRAM只能使用TLS加密，以防止拦截SCRAM交换。如果Zookeeper受到威胁，则可以防止字典或暴力攻击，和防止伪装模仿。
从Kafka 2.0版开始，可以通过在Zookeeper不安全的安装中配置sasl.server.callback.handler.class来自定义回调处理程序覆盖默认的SASL/SCRAM凭据存储。
更多的安全注意事项，可参考RFC 5802。

实战笔记

kafka实战SASL/SCRAM

kafka在broker中启用多个SASL机制

在broker中启用多个SASL机制

在JAAS文件中的KafkaServer中启用所有机制的登录模块配置。例如：

 KafkaServer {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   storeKey=true
   keyTab="/etc/security/keytabs/kafka_server.keytab"
   principal="kafka/kafka1.hostname.com@EXAMPLE.COM";

   org.apache.kafka.common.security.plain.PlainLoginModule required
   username="admin"
   password="admin-secret"
   user_admin="admin-secret"
   user_alice="alice-secret";
 };

在server.properties中启用SASL机制

 sasl.enabled.mechanisms=GSSAPI,PLAIN,SCRAM-SHA-256,SCRAM-SHA-512

如果需要broker之间通讯，则在server.properteis中指定SASL安全协议和机制。

 security.inter.broker.protocol=SASL_PLAINTEXT (or SASL_SSL)
 sasl.mechanism.inter.broker.protocol=GSSAPI (or one of the other enabled mechanisms)

按照机制 - GSSAPI（Kerberos），PLAIN和SCRAM中的具体步骤来配置启用的SASL机制。

kafka在运行的集群中修改SASL机制

可以按照以下顺序在正在运行的群集中修改SASL机制：

将新SASL机制添加到每个broker上的server.properteis中的sasl.enabled.mechanisms上。更新JAAS配置文件以包括这两个机制，如这里所述。逐步重启群集节点。
使用新机制重新启动集群。
要改变broker之间的通讯（如果需要），则设置在server.properteis中的sasl.mechanism.inter.broker.protocol为新的机制并逐个重启。
要移除老的机制（如果需要），从server.properties的sasl.enabled.mechanisms和JAAS配置文件中移除旧机制。然后依次重启。

kafka认证和acl

kafka附带一个可插拔的ACL（Access Control List 访问控制列表），它使用zookeeper来存储。通过在server.properties中设置authorizer.class.name来启用：

authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer

Kafka acls的格式为 "Principal P is [Allowed/Denied] Operation O From Host H On Resource R”，你可以使用Kafka authorizer CLI 来添加，删除或查询所有acl。默认情况下，如果ResourcePatterns与特定的资源R没有匹配，则除了超级用户之外，都不允许访问R。如果要更改该行为，可以在server.properties中包含以下内容。

allow.everyone.if.no.acl.found=true

你也可以在server.properties添加超级用户，像这样（注意分隔符是分号，因为SSL的用户名是逗号）。

super.users=User:Bob;User:Alice

默认情况下，SSL用户名的格式为“CN=writeuser,OU=Unknown,O=Unknown,L=Unknown,ST=Unknown,C=Unknown”。可以通过在server.properties中设置自定义的PrincipalBuilder来改变它，如下所示:

principal.builder.class=CustomizedPrincipalBuilderClass

可以通过修改server.properties中的sasl.kerberos.principal.to.local.rules自定义规则。sasl.kerberos.principal.to.local.rules的格式是一个列表，其中每个规则的工作方式与Kerberos 配置文件 (krb5.conf)中的auth_to_local相同。也支持小写规则，可通过在规则的末尾添加“/L”，强制转移全部结果为小写。每个规则都以RULE开头：并包含一个表达式，格式如下。有关更多详细信息，请参阅kerberos文档。

RULE:[n:string](regexp)s/pattern/replacement/
RULE:[n:string](regexp)s/pattern/replacement/g
RULE:[n:string](regexp)s/pattern/replacement//L
RULE:[n:string](regexp)s/pattern/replacement/g/L

举个例子，添加规则，将user@MYDOMAIN.COM转换为用户，同时保持默认规则，示例如下：

sasl.kerberos.principal.to.local.rules=RULE:[1:$1@$0](.*@MYDOMAIN.COM)s/@.*//,DEFAULT

命令行界面

Kafka认证管理CLI（和其他的CLI脚本）可以在bin目录中找到。CLI脚本名是kafka-acls.sh。以下列出了所有脚本支持的选项：

选项	描述	默认	类型选择
--add	添加一个acl		Action
--remove	移除一个acl		Action
--list	列出acl		Action
--authorizer	authorizer的完全限定类名	kafka.security.auth.SimpleAclAuthorizer	Configuration
--authorizer-properties	key=val，传给authorizer进行初始化，例如：zookeeper.connect=localhost:2181		Configuration
--cluster	指定集群作为资源。		Resource
--topic [topic-name]	指定topic作为资源。		Resource
--group [group-name]	指定 consumer-group 作为资源。		Resource
-allow-principal	添加到允许访问的ACL中，Principal是PrincipalType:name格式。你可以指定多个。		Principal
--deny-principal	添加到拒绝访问的ACL中，Principal是PrincipalType:name格式。你可以指定多个。		Principal
--allow-host	--allow-principal中的principal的IP地址允许访问。	如果--allow-principal指定的默认值是`*`，则意味着指定“所有主机”	Host
--deny-host	允许或拒绝的操作。有效值为：读，写，创建，删除，更改，描述，ClusterAction，全部	ALL	Operation
--operation	--deny-principal中的principals的IP地址拒绝访问。	如果 --deny-principal指定的默认值是 * 则意味着指定 "所有主机"	Host
--producer	为producer角色添加/删除acl。生成acl，允许在topic上WRITE, DESCRIBE和CREATE集群。		Convenience
--consumer	为consumer role添加/删除acl，生成acl，允许在topic上READ, DESCRIBE 和 consumer-group上READ。		Convenience
--force	假设所有操作都是yes，规避提示		Convenience

例子

添加acl
假设你要添加一个acl “以允许198.51.100.0和198.51.100.1，Principal为User:Bob和User:Alice对主题是Test-Topic有Read和Write的执行权限” 。可通过以下命令实现：
```
bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:Bob --allow-principal User:Alice --allow-host 198.51.100.0 --allow-host 198.51.100.1 --operation Read --operation Write --topic Test-topic
```
默认情况下，所有的principal在没有一个明确的对资源操作访问的acl都是拒绝访问的。在极少的情况下，acl允许访问所有的资源，但一些principal我们可以使用 --deny-principal 和 --deny-host来拒绝访问。例如，如果我们想让所有用户读取Test-topic，只拒绝IP为198.51.100.3的User:BadBob，我们可以使用下面的命令:
```
bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:* --allow-host * --deny-principal User:BadBob --deny-host 198.51.100.3 --operation Read --topic Test-topic
```
需要注意的是--allow-host和deny-host仅支持IP地址（主机名不支持）。上面的例子中通过指定--topic [topic-name]作为资源选项添加ACL到一个topic。同样，用户通过指定--cluster和通过指定--group [group-name]消费者组添加ACL。

删除acl
删除和添加是一样的，--add换成--remove选项，要删除第一个例子中添加的，可以使用下面的命令：

bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --remove --allow-principal User:Bob --allow-principal User:Alice --allow-host 198.51.100.0 --allow-host 198.51.100.1 --operation Read --operation Write --topic Test-topic

acl列表
我们可以通过--list选项列出所有资源的ACL。假设要列出Test-topic，我们可以用下面的选项执行CLI所有的ACL：
```
bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --list --topic Test-topic
```
添加或删除作为生产者或消费者的principal
acl管理添加/移除一个生产者或消费者principal是最常见的使用情况，所以我们增加更便利的选项处理这些情况。为主题Test-topic添加一个生产者User:Bob，我们可以执行以下命令：
```
bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:Bob --producer --topic Test-topic
```
同样，添加Alice作为主题Test-topic的消费者，用消费者组为Group-1，我们只用 --consumer 选项：
```
bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:Bob --consumer --topic test-topic --group Group-1
```
注意，消费者的选择，我们还必须指定消费者组。从生产者或消费者角色删除主体，我们只需要通过--remove选项。

kafka在正在运行的集群中整合安全功能

你可以为正在运行的集群增加1个或多个我们前面讨论的安全协议，这是分阶段完成的：

以增量替换的方式添加额外的安全端口(s)。
客户端使用安全的端口来连接，而不是PLAINTEXT端口的（假设你是客户端需要安全连接broker）。
再次增量的方式依次启用broker与broker之间的安全端口（如果需要）
最后依次关闭PLAINTEXT端口。

7.2和7.3节介绍了配置SSL和SASL的具体步骤。按照以下步骤启用所需的安全协议。

为broker与client和broker与broker之间配置安全通讯协议。这些都必须新增启用，PLAINTEXT端口必须保留，是为了broker或客户端可以继续通讯。

当依次替换时，broker通过SIGTERM进行清理。等待重新启动的副本在移动到下一个节点之前返回到ISR列表也是一个很好的做法。

举个例子，假设我们希望在broker与client和broker与broker之间使用SSL进行通讯加密，那么需要在每个节点上打开SSL端口：

listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092

然后，我们重新启动client，改变指向新的安全端口：

bootstrap.servers = [broker1:9092,...]
security.protocol = SSL
...etc

设置broker-broker协议（同样使用SSL端口）:

listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092
security.inter.broker.protocol=SSL

最后，我们关闭PLAINTEXT端口：

listeners=SSL://broker1:9092
security.inter.broker.protocol=SSL

另外，我们也可以打开多个端口，使用不同协议实现broker-broker和broker-client之间通讯。假设我们希望都使用SSL加密（即，broker-broker和broker-client通讯），但是我们也想对broker-client连接增加SASL认证，我们通过打开2个额外的端口来实现这一点：

listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092,SASL_SSL://broker1:9093

然后，重新启动客户端，改变他们的配置指向新的SASL&SSL安全端口：

bootstrap.servers = [broker1:9093,...]
security.protocol = SASL_SSL
...etc

然后，服务器将逐步切换，切换broker-broker之间的通讯到SSL。

listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092,SASL_SSL://broker1:9093
security.inter.broker.protocol=SSL

最后，关闭PLAINTEXT端口.

listeners=SSL://broker1:9092,SASL_SSL://broker1:9093
security.inter.broker.protocol=SSL

ZooKeeper可以独立于Kafka集群进行安全保护。

ZooKeeper认证kafka

7.6.1 新集群

若启用对broker的Zookeeper认证，有两个必要步骤：

创建一个JAAS登录文件并设置适当的系统属性指向它，如上文所述。
设置配置属性 zookeeper.set.acl的每个broker为true

存储在Zookeeper的元数据是这样的，只有broker能够修改相应的znodes，但是znodes是都可读的，原因是，存储在zookeeper的数据是不敏感的，但是znodes不当的操作可能导致集群中断，我们还是建议通过网络分割限制zookeeper（仅允许broker和一些管理工具访问zookeeper）。

7.6.2 迁移集群

如果你使用kafka的版本不支持安全的或简单的禁用安全，你想设置集群安全，则需要执行以下步骤启用ZooKeeper认证（最小的中断操作）：

滚动重新启动设置的JAAS登录文件，这使broker进行身份认证，在滚动重启结束后，broker就能够用ACL操作这些znode（节点）了（但不能创建）。
执行第二次滚动重启，这次设置配置参数zookeeper.set.acl为true，这样就能使用安全的ACL创建znode。
执行ZkSecurityMigrator工具，执行脚本：./bin/zookeeper-security-migration.sh，zookeeper.acl 设置secure，这个工具将遍历更改 Acl 的 znodes 相应的 sub-trees。

也在安全集群中关闭认证，按照以下步骤：

滚动重新启动设置的JAAS登录文件，启动broker认证，但是设置zookeeper.set.acl为false。重启结束之后，broker停止用ACL创建znodes，但是仍然能认证和操作znodes。
执行ZkSecurityMigrator工具，运行脚本：./bin/zookeeper-security-migration.sh，zookeeper.acl设置为不安全的，这个工具将遍历更改 Acl 的 znodes 相应的 sub-trees。
执行第二次滚动重启broker，这次忽略了JAAS登录文件设置系统属性。

提供一个例子，如何运行迁移工具：

./bin/zookeeper-security-migration --zookeeper.acl=secure --zookeeper.connection=localhost:2181

查看完整列表：

./bin/zookeeper-security-migration --help

7.6.3 Migrating the ZooKeeper ensemble

也有必要对全部zookeeper启用身份验证。要做到这一点，我们需要设置一些属性参数。请参阅更详细的zookeeper文档︰:

Kafka Connect 用户指南

8.1 概述

Kafka Connect 是一个可扩展、可靠的在Kafka和其他系统之间流传输的数据工具。它可以通过connectors（连接器）简单、快速的将大集合数据导入和导出kafka。Kafka Connect可以接收整个数据库或收集来自所有的应用程序的消息到Kafka Topic。使这些数据可用于低延迟流处理。导出可以把topic的数据发送到secondary storage（辅助存储也叫二级存储）也可以发送到查询系统或批处理系统做离线分析。Kafka Connect功能包括：

Kafka连接器通用框架：Kafka Connect 规范了kafka与其他数据系统集成，简化了connector的开发、部署和管理。
分布式和单机模式 - 扩展到大型支持整个organization的集中管理服务，也可缩小到开发，测试和小规模生产部署。
REST 接口 - 使用REST API来提交并管理Kafka Connect集群。
自动的offset管理 - 从connector获取少量的信息，Kafka Connect来管理offset的提交，所以connector的开发者不需要担心这个容易出错的部分。
分布式和默认扩展 - Kafka Connect建立在现有的组管理协议上。更多的工作可以添加扩展到Kafka Connect集群。
流/批量集成 - 利用kafka现有的能力，Kafka Connect是一个桥接流和批量数据系统的理想解决方案。

8.2 用户指南

提供了一个快速入门的例子，运行一个单机版的Kafka Connect。本节更详细的介绍如何配置，运行和管理Kafka Connect。

运行Kafka Connect

Kafka Connect目前支持两种执行模式：独立（单进程）和分布式。在独立模式下，所有的工作都在一个单进程中进行的。这样易于配置，在一些情况下，只有一个在工作是好的（例如，收集日志文件），但它不会从kafka Connection的功能受益，如容错。通过下面的命令开始一个单进程的例子：

> bin/connect-standalone.sh config/connect-standalone.properties connector1.properties [connector2.properties ...]

第一个参数是worker（工人）的配置，这包括 Kafka连接的参数设置，序列化格式，以及频繁地提交offset（偏移量）。本节提供的例子用的是默认的配置 conf/server.properties。其余的参数是connector（连接器）配置文件。你可以配置你需要的，但是所有的执行都在同一个进程（在不同的线程）。分布式的模式会自动平衡。允许你动态的扩展（或缩减），并在执行任务期间和配置、偏移量提交中提供容错保障，非常类似于独立模式：

bin/connect-distributed.sh config/connect-distributed.properties

在不同的类中，配置参数定义了Kafka Connect如何处理，哪里存储配置，如何分配work，哪里存储offset和任务状态。在分布式模式中，Kafka Connect在topic中存储offset，配置和任务状态。建议手动创建offset的topic，可以自己来定义需要的分区数和副本数。如果启动Kafka Connect时还没有创建topic，那么topic将自动创建（使用默认的分区和副本），这可能不是最合适的（因为kafka可不知道业务需要，只能根据默认参数创建）。特别是以下配置参数尤为关键，启动集群之前设置：

group.id (默认connect-cluster) - Connect cluster group使用唯一的名称；注意这不能和consumer group ID（消费者组）冲突。
config.storage.topic (默认connect-configs) - topic用于存储connector和任务配置；注意，这应该是一个单个的partition，多副本的topic。你需要手动创建这个topic，以确保是单个partition（自动创建的可能会有多个partition）。
offset.storage.topic (默认 connect-offsets) - topic用于存储offsets；这个topic应该配置多个partition和副本。
status.storage.topic (默认 connect-status) - topic 用于存储状态；这个topic 可以有多个partitions和副本

注意，在分布式模式中，connector（连接器）配置不能使用命令行。要使用下面介绍的REST API来创建，修改和销毁connector。

配置连接器（connector）

Connector的配置是简单的key-value映射。对于独立模式，这些都是在属性文件中定义，并通过在命令行上的Connect处理。在分布式模式，JSON负载connector的创建（或修改）请求。大多数配置都是依赖的connector，有几个常见的选项：

name - 连接器唯一的名称，不能重复。
connector.calss - 连接器的Java类。
tasks.max - 连接器创建任务的最大数。
connector.class配置支持多种格式：全名或连接器类的别名。比如连接器是org.apache.kafka.connect.file.FileStreamSinkConnector，你可以指定全名，也可以使用FileStreamSink或FileStreamSinkConnector。Sink connector也有一个额外的选项来控制它们的输入：
topics - 作为连接器的输入的topic列表。

对于其他的选项，你可以查看连接器的文档。

REST API

由于Kafka Connect的目的是作为一个服务运行，提供了一个用于管理connector的REST API。默认情况下，此服务的端口是8083。以下是当前支持的终端入口：

GET /connectors - 返回活跃的connector列表
POST /connectors - 创建一个新的connector；请求的主体是一个包含字符串name字段和对象config字段（connector的配置参数）的JSON对象。
GET /connectors/{name} - 获取指定connector的信息
GET /connectors/{name}/config - 获取指定connector的配置参数
PUT /connectors/{name}/config - 更新指定connector的配置参数
GET /connectors/{name}/status - 获取connector的当前状态，包括它是否正在运行，失败，暂停等。
GET /connectors/{name}/tasks - 获取当前正在运行的connector的任务列表。
GET /connectors/{name}/tasks/{taskid}/status - 获取任务的当前状态，包括是否是运行中的，失败的，暂停的等，
PUT /connectors/{name}/pause - 暂停连接器和它的任务，停止消息处理，直到connector恢复。
PUT /connectors/{name}/resume - 恢复暂停的connector（如果connector没有暂停，则什么都不做）
POST /connectors/{name}/restart - 重启connector（connector已故障）
POST /connectors/{name}/tasks/{taskId}/restart - 重启单个任务 (通常这个任务已失败)
DELETE /connectors/{name} - 删除connector, 停止所有的任务并删除其配置

Kafka Connector还提供了获取有关connector plugins信息的REST API：

GET /connector-plugins- 返回已在Kafka Connect集群安装的connector plugin列表。请注意，API仅验证处理请求的worker的connector。这以为着你可能看不不一致的结果，特别是在滚动升级的时候（添加新的connector jar）
PUT /connector-plugins/{connector-type}/config/validate - 对提供的配置值进行验证，执行对每个配置验证，返回验证的建议值和错误信息。

Kafka Connector开发指南

8.3 Connector开发指南

本指南介绍了开发者怎么样编写新的connector，用于kafka和其他系统之间的数据移动。简要回顾几个关键的概念，然后介绍如何创建一个简单的connector。

核心概念和API

在Kafka和其他系统之间复制数据，用户创建自定义的从系统中pull数据或push数据到系统的Connector（连接器）。Connector有两种形式：SourceConnectors从其他系统导入数据（如：JDBCSourceConnector将导入一个关系型数据库到Kafka）和SinkConnectors导出数据（如：HDFSSinkConnector将kafka主题的内容导出到HDFS文件）。connector不会执行任何复制自己的数据：它们的配置展示了要复制的数据，而Connector是负责打破这一工作变成一组可以分配worker的任务。这些任务也有两种相对应的形式：SourceTask 和 SinkTask。在手里的任务，每个任务必须复制其子集的数据或Kafka的。在Kafka Connect，这些任务作为一组具有一致性模式的记录（消息）组成的输出和输入流。有时，这种映射是明显的：在一组日志文件，每个文件可以被视为一个流，每个分析的行形成一个记录，使用相同的模式和offset存储在文件中的字节偏移。在其他的情况下可能需要更多的努力来映射到该模型：一个JDBC连接器可以将每张表映射到一个流，但offset是不太清楚的。一种可能的映射使用时间戳列来生成查询递增返回新的数据，上次查询时间戳可被用作offset。

流和记录（Streams and Records）

每个流都应该有一个key-value的记录序列。key和value可以具有复杂的结构 — 提供了许多原始类型，但数组、对象和嵌套的数据结构也可以。运行时，数据格式不承担任何特定的序列化格式，这种转换是由框架内部处理的。除了key和value，记录（由源和传递到sink产生的）关联的流ID和offset。这些都是使用了框架。定期提交的offset的数据（已处理的），以便在发生故障时，处理可以从最后一个提交的偏移量恢复，避免不必要的重复处理。

动态连接器（Dynamic Connectors）

并非所有的工作都是静态的，Connector（连接器）的实现还负责监控外部系统（根据外部系统的变化可能需要重新配置）。例如，在JDBCSourceConnector的例子中，Connector可分配一组表到每个任务。当创建一个新的表，它必须要发现这个新表，并更新到配置把新的表分配到任务中。当注意到一个变化，需要重新配置（或任务数量的变化），它通知框架更新相应的任务。

开发一个简单的连接器（Connector）

开发一个连接器只需要实现两个接口，Connector和Task。在Kafka源代码里file包下有一个简单的例子。该connector是用于独立模式，SourceConnector/SourceTask实现文件每行读取，并作为记录（消息）用SinkConnector/SinkTask把每个记录写到一个文件。本节的其余部分将通过一些代码来演示创建一个连接器的关键步骤，但开发者也应参考完整的例子的源代码，大部分的细节都略为简单。

connector例子

我们拿SourceConnector作为一个简单的例子。SinkConnector的实现也非常类似。通过创建一个继承SourceConnector的类开始，增加一个字段存储解析的配置信息（文件名读取和发送数据到topic）：

public class FileStreamSourceConnector extends SourceConnector {
    private String filename;
    private String topic;

最简单的方法是getTaskClass()，它定义了在工作进程中实例化的实际读取数据的类：

@Override
public Class<? extends Task> getTaskClass() {
    return FileStreamSourceTask.class;
}

定义FileStreamSourceTask类，接下来，我们增加一些标准的生命周期的方法，start()和stop()：

@Override
public void start(Map<String, String> props) {
    // The complete version includes error handling as well.
    filename = props.get(FILE_CONFIG);
    topic = props.get(TOPIC_CONFIG);
}

@Override
public void stop() {
    // Nothing to do since no background monitoring is required.
}

最后，是实现真正核心的getTaskConfigs()。在这种情况下，我们只处理一个文件，这样即使我们生成更多的任务（根据maxTask参数），我们返回一个列表，只有一个入口：

@Override
public List<Map<String, String>> getTaskConfigs(int maxTasks) {
    ArrayList>Map<String, String>> configs = new ArrayList<>();
    // Only one input stream makes sense.
    Map<String, String> config = new Map<>();
    if (filename != null)
        config.put(FILE_CONFIG, filename);
    config.put(TOPIC_CONFIG, topic);
    configs.add(config);
    return configs;
}

虽然在本例中未使用，SourceTask也提供了两个API来提交源系统的offset：commit 和 commitRecord。提供了有消息确认机制的源系统API。重写这些方法，允许source connector（源连接器）在源系统应答消息。一旦他们写入到kafka，无论消息是成批的还是单独。commit API在源系统存储offset，由poll返回offset。这个API的实现是阻塞的，直到提交完成。commitRecord API为在源系统中的每个写入到Kafka之后的SourceRecord保存offset，Kafka Connect自动记录offset。SourceTasks不需要实现。在connector需要确认在源系统acknowledge（应答）消息的情况下，即使有多个任务，这种方法实现通常是非常简单的，只需要一个API。它只确定输入任务的数量，这可能需要它从远程服务提取数据。然后瓜分数据。由于一些模式之间分配work（工作）非常普遍，有些实用工具提供了ConnectorUtils来简化这些情况，注意，这个简单的例子不包括动态输入。详见在下一节讨论如何触发更新任务配置。

Task例子 - Source Task

接下来我们将介绍对应的SourceTask的实现。我们将使用伪代码来展示大部分的实现，你可以参考完整的示例的源代码。和连接器一样，我们需要创建一个类（继承基于Task的类）。它也有一些标准的生命周期方法：

public class FileStreamSourceTask extends SourceTask<Object, Object> {
    String filename;
    InputStream stream;
    String topic;

    public void start(Map<String, String> props) {
        filename = props.get(FileStreamSourceConnector.FILE_CONFIG);
        stream = openOrThrowError(filename);
        topic = props.get(FileStreamSourceConnector.TOPIC_CONFIG);
    }

    @Override
    public synchronized void stop() {
        stream.close();
    }

稍微简化了一下，说明这些方法是比较简单的，仅分配或释放资源。有两个点需要注意。首先，start()方法还未处理以前offset的恢复，这将在后面的部分讨论，其次，stop()方法是同步的。SourceTasks提供了专门的线程，可以无限期的阻塞。所以需要从别的Worker线程来停止。接下来，我们实现任务的主要功能，poll()方法。它从输入系统获取时间并返回一个List<SourceRecord>。

@Override
public List<SourceRecord> poll() throws InterruptedException {
    try {
        ArrayList<SourceRecord> records = new ArrayList<>();
        while (streamValid(stream) && records.isEmpty()) {
            LineAndOffset line = readToNextLine(stream);
            if (line != null) {
                Map<String, Object> sourcePartition = Collections.singletonMap("filename", filename);
                Map<String, Object> sourceOffset = Collections.singletonMap("position", streamOffset);
                records.add(new SourceRecord(sourcePartition, sourceOffset, topic, Schema.STRING_SCHEMA, line));
            } else {
                Thread.sleep(1);
            }
        }
        return records;
    } catch (IOException e) {
        // Underlying stream was killed, probably as a result of calling stop. Allow to return
        // null, and driving thread will handle any shutdown if necessary.
    }
    return null;
}

同样，我们省略了一些细节，我们可以看到重要的步骤：poll()方法反复的调用，并每次调用都会尝试从文件中读取记录（消息）。读取每一行，也跟踪文件的offset。它使用该信息来创建一个输出SourceRecord和四条信息：源分区（只有1个，读取单个文件），源offset（在文件中的字节offset），输出topic的name，和输出value（行，包括一个模式，表示value始终是一个string）。SourceRecord构造函数的其他实现也包括一个指定的输出分区和key。请注意，此实现使用正常的Java InputStream接口，如果数据不可用则可以sleep（休眠）。这个可以接受，因为Kafka Connect为每个任务提供了一个专用的线程。而任务实现必须基于poll()接口，这样有跟多的灵活性（自己实现）。在这种情况下，基于NIO的实现会更有效，但方法简单，快速实现，并兼容老版本（Java）。

Sink Tasks

前面已经介绍了如何实现一个简单的SourceTast。不像SourceConnector 和 SinkConnector, SourceTask 和 SinkTask有很多不同的接口，因为SourceTask使用pull接口，SinkTask使用push接口。两者都有共同的生命周期的方法，但是SinkTask接口是完全不同的：

public abstract class SinkTask implements Task {
    public void initialize(SinkTaskContext context) {
        this.context = context;
    }

    public abstract void put(Collection<SinkRecord> records);

    public abstract void flush(Map<TopicPartition, Long> offsets);

SinkTask文档有全部的细节，但接口和SourceTask一样简单。put()方法包含大部分的实现，接收集合SinkRecords，执行转换，并存储到目标系统。这个方法不需要确保返回之前数据完全写入到目标系统。事实上，在大部分情况下，内存缓冲是有用的，这样记录可以按一个整批次一次发送，从而减少插入事件进入downstream（下游）数据存储的开销。SinkRecord作为SourceRecords包含相同的信息：Kafka topic，partition，offset和事件key和value。flush()方法在offset提交过程期间，它允许任务从故障中恢复，并从安全点恢复（这样就没有事件会被错过）。该方法应该将任何未完成的数据push到目标系统，然后阻塞，直到写入已得到确认。通常offset参数可以忽略，但在某些情况下，想要实现存储offset信息到目标系统以提供正好一次的语义。例如，HDFS connector（连接器）可以做到这一点，使用原子移动操作来确保flush()的原子性，确保提交数据和offset到最终的位置（HDFS）。

从之前的offset恢复（Resuming from Previous Offsets）

SourceTask包含一个流ID（输入的文件名）和每个记录的offset（文件中的位置）。框架使用了定时提交offset，所以在故障的情况下，任务恢复并减少再处理和可能重复的事件数（如果Kafka Connect正常的停止，可从最近的offset恢复，例如在独立模式或重新加载配置）。提交处理是完全自动化的，但只有connector知道如何返回到正确的位置，从该位置恢复。正确的恢复后，任务可以使用SourceContext传递其initialize()方法来访问offset数据。在initialize()中，我们会添加一些代码来读取offset（如果存在），并找到它的位置。

  stream = new FileInputStream(filename);
    Map<String, Object> offset = context.offsetStorageReader().offset(Collections.singletonMap(FILENAME_FIELD, filename));
    if (offset != null) {
        Long lastRecordedOffset = (Long) offset.get("position");
        if (lastRecordedOffset != null)
            seekToOffset(stream, lastRecordedOffset);
    }

当然，你可能需要为每个输入流读取大量的key。OffsetStorageReader接口也允许批量读取（有效的负载所有的offset），然后找出每个输入流到合适的位置。

动态的输入/输出流 (Dynamic Input/Output Streams)

Kafka Connect的工作被定义为拷贝大量数据。如拷贝一个完整的数据库，而不是创建多个job来分别复制每一张表。这种设计的后果是，一个connector的输入或输出流集合可以随着时间的推移而变化。Source connector需要监听源系统的改变。例如：数据库表的增加/删除。当发现改变，通过ConnectorContext对象通知框架，来重新加载。例如，在SourceConnector：

    if (inputsChanged())
        this.context.requestTaskReconfiguration();

该框架将立即请求新配置并更新任务，在重新加载配置之前优雅的提交自己的进度。注意，SourceConnector检测目前留给connector实现，如果需要一个额外的线程执行此监控。那么connector必须分配它自己。理想的情况下，监控变更代码将会隔离Connector和任务，不需担心。然而，变更也可能影响任务，最常见的是，当其中一个输入流在输入系统销毁了。例如：如果一张表从数据库中删除。如果任务在Connector之前遇到问题，如果Connector需要poll（轮询）变更，则任务将需要处理随后的错误。这些都是常见的问题。值得庆幸的是，这可以通过简单catch和处理相应的异常。SinkConnectors通常只能处理流的增加，它可以转换输出新的entires（例如，一个新的数据库表）。该框架管理对Kafka的输入的任何变更。例如当输入的topic集变化（由一个正则表达式的订阅）。 SinkTasks等待新的输入流，它可能需要在下游系统创建新的资源。比如数据库中的新表，最棘手的情况是在这种情况产生的冲突（多个SinkTask看到一个新的输入流并同时尝试去创建新的资源）。SinkConnectors，另一方面，一般不需要特殊的代码来处理一组动态的流。

连接配置验证（Connect Configuration Validation）

Kafka Connect允许你在提交要执行的connector之前来验证connector的配置，并可以提供故障和推荐值的反馈。利用这个优势，connector开发者需要提供一个config()的实现来暴露配置给框架。下面的代码在FileStreamSourceConnector定义配置和暴露给框架。

 private static final ConfigDef CONFIG_DEF = new ConfigDef()
        .define(FILE_CONFIG, Type.STRING, Importance.HIGH, "Source filename.")
        .define(TOPIC_CONFIG, Type.STRING, Importance.HIGH, "The topic to publish data to");

    public ConfigDef config() {
        return CONFIG_DEF;
    }

ConfigDef类用于指定预期的配置集，对于每个配置，你可以指定name，type，默认值，描述，group信息，group中的顺序，配置值的宽和适于在UI显示的名称。另外，你可以通过重写Validator类来指定的验证逻辑用于单个配置验证。此外，由于配置之间有可能存在依赖关系。例如，配置中的vaild和visibillty的值可能会根据其他的配置的值而变化。为了解决这个问题，ConfigDef允许你指定一个配置依赖，并提供推荐系统的实现来获取valid的值并设置visibillty得到的当前配置值。此外，Connector的validate()方法提供了一个默认验证实现，返回一个列表（返回允许配置的列表，每个配置的配置错误和推荐值）。然后，它不适用配置验证的推荐值。你可以提供一个自定义的配置验证覆盖的默认实现，这可能会使用建议的值。

Working with Schemas

FileStream connector是很好的例子，因为它很简单，但是也有很普通的结构化数据 — 每行只有一个字符串（string），实际connector都需要更复杂的数据格式模式，要创建更复杂的数据，你需要使用Kafka Connect data API。除了原始类型的影响，大多数记录的结构需要2个类：Schema和Struct。API文档有完整参考。这里是一个简单的例子，创建一个Schema和Struct：

Schema schema = SchemaBuilder.struct().name(NAME)
    .field("name", Schema.STRING_SCHEMA)
    .field("age", Schema.INT_SCHEMA)
    .field("admin", new SchemaBuilder.boolean().defaultValue(false).build())
    .build();

Struct struct = new Struct(schema)
    .put("name", "Barbara Liskov")
    .put("age", 75)
    .build();

如果你实现一个source connector，你需要决定何时以及如何创建schema。如果可能的话，你应该尽量避免重复计算。例如，如果你的connector保证有一个固定的schema，用静态和使用单例，然而，大部分connector有动态的schema。一个简单的例子，一个数据库connector。甚至只考虑一张表，这个schema不会预定义整个connector（因为它表到表的变化），但它也不会固定为单表生命周期中，因为用户可能会ALTER TABLE（修改表）。connector必须能够检测这些变化并作出反应，Sink connector之所以简单，是因为它们消费数据，不需要创建shema。但是，它们应该同样的去关心验证它们收到的schema的格式是预期的。当schema不匹配 — 通常表示上游的生产者生产无效的数据不能被正确的转换到目标系统 — sink connectors抛出一个异常给系统。

Kafka Connect管理（Kafka Connect Administration）

Kafka Connect的REST层提供了一组API来管理集群。这包括查看connector的配置和任务的状态，以及改变其当前的行为（例如改变配置和重新启动任务）。

当一个connector第一次被提交到集群，worker重新平衡集群中全部的connector和它们的任务。使每个worker具有大致相同的工作量。当connector递增和减少任务数，或connector配置发生变化时，也使用了同样的重新平衡程序。你可以使用REST API查看connector当前的状态和任务，包括每个分配worker的id。例如，查询一个源文件的状态（使用 GET /connectors/file-source/status）可能会产生如下的输入：

{
  "name": "file-source",
  "connector": {
    "state": "RUNNING",
    "worker_id": "192.168.1.208:8083"
  },
  "tasks": [
    {
      "id": 0,
      "state": "RUNNING",
      "worker_id": "192.168.1.209:8083"
    }
  ]
}

Connector和它们的任务发布状态状态更新到共享topic（配置status.storage.topic）集群监控中的所有的worker。因为woker异步消费这个topic，在一个状态改变之前，有一个典型的（短）延迟是可见的（通过状态API）。下列的状态可能是connector或是其任务之一：

UNASSIGNED: connector/task 还未分配给worker.
RUNNING: connector/task 正在运行.
PAUSED: The connector/task has been administratively paused.
FAILED: connector/tast故障（通常是抛出一个异常，状态输出报告）。

在多数情况下，connector和任务状态将匹配，尽管他们可能短时间不同（当发生变化或任务故障）。例如，当一个connector刚启动时，connector和其任务转换到运行状态之前，可能有明显的延时。当任务故障（因为Connect不会自动重启故障的任务）状态还会出现分歧。手动的重启connector/任务时，可以使用上面列出的重启API。注意，如果你尝试去重启任务（这个任务正在rebalance），Connect将会返回一个409（冲突）状态代码。你可以在rebalance完成之后重试，但是没有必要，因为rebalance有效地重新启动集群中的所有connector和任务。

有时可以暂时的停止connector的消息处理。例如，如果远程系统正在维护，最好source connector停止poll，而不是一直报错误的异常日志刷屏。对于这个用例，Connect提供了一个暂停/恢复（pause/resume）的API。虽然source connector暂停了，Connect将停止poll额外的记录。当sink connector被暂停时，Connect将停止向它推送新的消息。暂停状态是持久性的。所以即使你重新启动集群，connector也不会再次启动消费处理，直到任务恢复。注意，connector的任务转换到PAUSED状态时有可能会有延迟。因为它需要在暂停的期间来完成所有处理。另外，失败的任务不会转换到PAUSED状态，直到它们重新启动。

Kafka Streams开发者指南

9. Kafka Streams

9.1 概述

Kafka Streams是一个客户端程序库，用于处理和分析存储在Kafka中的数据，并将得到的数据写回Kafka或发送到外部系统。Kafka Stream基于一个重要的流处理概念。如正确的区分事件时间和处理时间，窗口支持，以及简单而有效的应用程序状态管理。Kafka Streams的入口门槛很低: 你可以快速的编写和在单台机器上运行一个小规模的概念证明（proof-of-concept）；而你只需要运行你的应用程序部署到多台机器上，以扩展高容量的生产负载。Kafka Stream利用kafka的并行模型来透明的处理相同的应用程序作负载平衡。

Kafka Stream 的亮点：

设计一个简单的、轻量级的客户端库，可以很容易地嵌入在任何java应用程序与任何现有应用程序封装集成。
Apache Kafka本身作为内部消息层，没有外部系统的依赖，还有，它使用kafka的分区模型水平扩展处理，并同时保证有序。
支持本地状态容错，非常快速、高效的状态操作（如join和窗口的聚合）。
采用 one-recored-at-a-time（一次一个消息） 处理以实现低延迟，并支持基于事件时间(event-time)的窗口操作。
提供必要的流处理原语(primitive)，以及一个 高级别的Steram DSL 和 低级别的Processor API。

9.2 核心概念

我们首先总结Kafka Streams的关键概念。

Stream处理拓扑

流是Kafka Stream提出的最重要的抽象概念：它表示一个无限的，不断更新的数据集。流是一个有序的，可重放（反复的使用），不可变的容错序列，数据记录的格式是键值对（key-value）。
通过Kafka Streams编写一个或多个的计算逻辑的处理器拓扑。其中处理器拓扑是一个由流（边缘）连接的流处理（节点）的图。
流处理器是处理器拓扑中的一个节点；它表示一个处理的步骤，用来转换流中的数据（从拓扑中的上游处理器一次接受一个输入消息，并且随后产生一个或多个输出消息到其下游处理器中）。

在拓扑中有两个特别的处理器：

源处理器（Source Processor）：源处理器是一个没有任何上游处理器的特殊类型的流处理器。它从一个或多个kafka主题生成输入流。通过消费这些主题的消息并将它们转发到下游处理器。
Sink处理器：sink处理器是一个没有下游流处理器的特殊类型的流处理器。它接收上游流处理器的消息发送到一个指定的Kafka主题。

screenshot

Kafka streams提供2种方式来定义流处理器拓扑：Kafka Streams DSL提供了更常用的数据转换操作，如map和filter；低级别Processor API允许开发者定义和连接自定义的处理器，以及和状态仓库交互。

处理器拓扑仅仅是流处理代码的逻辑抽象。

时间

在流处理方面有一个重要的时间概念，以及它是如何建模和集成。例如：一些操作，如基于时间界限定义的窗口。

时间在流中的常见概念如下：

事件时间 - 当一个事件或数据记录发生的时间点，就是最初创建的“源头”。
处理时间 - 事件或数据消息发生在流处理应用程序处理的时间点。即，记录已被消费。处理时间可能是毫秒，小时，或天等。比原始事件时间要晚。
摄取时间 - 事件或数据记录是Kafka broker存储在topic分区的时间点。与事件时间的差异是，当记录由Kafka broker追加到目标topic时，生成的摄取时间戳，而不是消息创建时间（“源头”）。与处理时间的差异是处理时间是流处理应用处理记录时的时间。比如，如果一个记录从未被处理，那么久没有处理时间，但仍然有摄取时间。

Kafka Streams通过TimestampExtractor接口为每个数据记录分配一个时间戳。该接口的具体实现了基于数据记录的实际内容检索或计算获得时间戳，例如嵌入时间戳字段提供的事件时间语义，或使用其他的方法，比如在处理时返回当前的wall-clock（墙钟）时间，从而产生了流应用程序的处理时间语义。因此开发者可以根据自己的业务需要选择执行不同的时间。例如，每条记录时间戳描述了流的时间增长（尽管记录在stream中是无序的）并利用时间依赖性来操作，如join。

最后，当一个Kafka Streams应用程序写入记录到kafka时，它将分配时间戳到新的消息。时间戳分配的方式取决于上下文：

当通过处理一些输入记录（例如，在process（）函数调用中触发的context.forward（））生成新的输出记录时，输出记录时间戳直接从输入记录时间戳继承。
当通过周期性函数（如punctuate()）生成新的输出记录时。输出记录时间戳被定义为流任务的当前内部时间（通过context.timestamp()获取）。
对于聚合，生成的聚合更新的记录时间戳将被最新到达的输入记录触发更新。

状态

一些流处理程序不需要状态，这意味着消息处理是独立于其他的消息处理的。但是呢，能够保持状态，这为复杂的流处理程序打开了许多可能性：你可以加入输入流，或分组和汇总数据记录等。Streams DSL提供了许多如状态性的操作。

Kafka Stream提供了所谓的状态存储，流处理程序可以用来存储和查询数据。这是一个重要的能力。在Kafka Stream中的每一个任务嵌入了一个或多个状态存储，可通过API来存储和查询处理所需的数据。状态存储可以是一个持久的key/value存储，内存中的HashMap，或者是其他的数据结构。Kafka Stream提供了本地状态存储的故障容错和自动恢复。

正如我们上面提到的，Kafka Streams应用程序的计算逻辑被定义为一个处理器拓扑。目前，Kafka Streams提供2个API来定义处理器拓扑，将在下面的章节中讨论。

9.3 ARCHITECTURE（架构）

Kafka Streams通过生产者和消费者，并利用kafka自有的能力来提供数据平行性，分布式协调性，故障容错和操作简单性，从而简化了应用程序的开发，在本节中，我们将描述kafka Streams是如何工作的。

下图展示了Kafka Streams应用程序的解剖图，让我们来看一些细节。

screenshot

Stream分区和任务

Kafka分区数据的消息层用于存储和传输。Kafka Streams分区数据用于处理。在这两种情况下，这种分区使数据弹性，可扩展，高性能和容错。Kafka Streams使用了分区和任务的概念，基于Kafka主题分区的并行性模型。在并发环境行，Kafka Streams和Kafka之间有着紧密的联系：

每个流分区是完全有序的数据记录队列，并映射到kafka主题的分区。
流的数据消息与主题的消息映射。
数据记录中的keys决定了Kafka和Kafka Streams中数据的分区，即，如何将数据路由到指定的分区。

应用程序的处理器拓扑通过将其分成多个任务来进行扩展，更具体点说，Kafka Streams根据输入流分区创建固定数量的任务，其中每个任务分配一个输入流的分区列表（即，Kafka主题）。分区对任务的分配不会改变，因此每个任务是应用程序并行性的固定单位。然后，任务可以基于分配的分区实现自己的处理器拓扑；他们还可以为每个分配的分区维护一个缓冲，并从这些记录缓冲一次一个地处理消息。作为结果，流任务可以独立和并行的处理而无需手动干预。

重要的是要理解Kafka Streams不是资源管理器，而是可在任何地方都能“运行”的流处理应用程序库。多个实例的应用程序在同一台机器上执行，或分布多个机器上，并且任务可以通过该库自动的分发到这些运行的实例上。分区对任务的分配永远不会改变；如果一个应用程式实例失败，则其被分配的任务将自动地在其他的实例重新创建，并从相同的流分区继续消费。

下面展示了2个分区，每个任务分配了输出流的1个分区。

screenshot

线程模型

Kafka Streams允许用户配置线程数，可用于平衡处理应用程序的实例。每个线程的处理器拓扑独立的执行一个或多个任务。例如，下面展示了一个流线程运行2个流任务。
screenshot

启动更多的流线程或更多应用程序实例，只需复制拓扑逻辑（ps，就是多复制几个代码到不同的机器上运行），达到并行处理处理不同的Kafka分区子集的目的。要注意的是，这些线程之间不共享状态。因此无需协调内部的线程。这使它非常简单在应用实例和线程之间并行拓扑。Kafka主题分区的分配是通过Kafka Streams利用Kafka的协调功能在多个流线程之间透明处理。

如上所述，Kafka Streams扩展流处理应用程序是很容易的：你只需要运行你的应用程序实例，Kafka Streams负责在实例中运行的任务之间分配分区。你可以启动和应用程序线程一个多的输入Kafka主题分区。这样，所有运行中的应用实例，每个线程（或更确切的说，它运行的任务）至少有一个输入分区可以处理。

本地状态存储

存储，其实是流处理器应用程序可用来存储和查询数据，对于实现状态性操作是一个很重要的能力。例如，当你调用状态性操作时，如 join()或aggregate()，或当你在窗口化流时，Kafka Streams DSL会自动创建和管理这些状态存储。

在Kafka Streams应用程序的每个流任务可以键入一个或多个本地状态存储，这些本地状态存储可以通过API存储和查询处理所需的数据。Kafka Streams也为本地状态存储提供了容错和自动恢复的能力。

下图显示了两个流任务及其专用本地状态存储。

screenshot

故障容错

Kafka Streams基于Kafka分区的高可用和副本故障容错能力。因此，当流数据持久到Kafka，即使应用程序故障，如果需要重新处理它，它也是可用的。Kafka Streams中的任务利用Kafka消费者客户端提供的故障容错的能力来处理故障。如果任务故障，Kafka Streams将自动的在剩余运行中的应用实例重新启动该任务。

此外，Kafka Streams还确保了本地状态仓库对故障的稳定性。对于每个状态仓库都维持一个追踪所有的状态更新的变更日志主题。这些变更日志主题也分区，因此，每个本地状态存储实例，任务访问仓里，都有自己的专用的变更日志分区。变更主题日志也启用了日志压缩，以便可以安全的清除旧数据，以防止主题无限制的增长。如果任务失败并在其他的机器上重新运行，则Kafka Streams在恢复新启动的任务进行处理之前，重放相应的变更日志主题，保障在故障之前将其关联的状态存储恢复。故障处理对于终端用户是完全透明的。

请注意，任务（重新）初始化的成本通常主要取决于通过重放状态仓库变更日志主题来恢复状态的时间。为了减少恢复时间，用户可以配置他们的应用程序增加本地状态的备用副本（即。完全的复制状态）。当一个任务迁移发生时，Kafka Streams尝试去分配任务给应用实例。其中这样的备用副本已经存在，为了减少任务（重新）初始化的成本，请参见Kafka Streams配置章节的num.standby.replicas。

9.4 开发者指南

一个快速入门的示例代码，提供了如何运行一个流处理程序。本节重点介绍如何编写，配置和执行Kafka Streams应用程序。

低级别处理器API

Processor（处理器）

开发者可以通过Processor接口来实现自己的自定义处理逻辑，接口提供了 process 和 punctuate 方法。process方法执行接收的消息；并根据时间进行周期性地执行punctuate方法。此外，在init初始化方法中。processor可以保持当前的ProcessorContext实例变量，利用上下文来计划周期地（context().schedule）puncuation,转发修改后的/新的键值对(key-value)到下游系统（context().forward），提交当前的处理进度（context().commit）,等。

public class MyProcessor extends Processor {
        private ProcessorContext context;
        private KeyValueStore kvStore;

        @Override
        @SuppressWarnings("unchecked")
        public void init(ProcessorContext context) {
            this.context = context;
            this.context.schedule(1000);
            this.kvStore = (KeyValueStore) context.getStateStore("Counts");
        }

        @Override
        public void process(String dummy, String line) {
            String[] words = line.toLowerCase().split(" ");

            for (String word : words) {
                Integer oldValue = this.kvStore.get(word);

                if (oldValue == null) {
                    this.kvStore.put(word, 1);
                } else {
                    this.kvStore.put(word, oldValue + 1);
                }
            }
        }

        @Override
        public void punctuate(long timestamp) {
            KeyValueIterator iter = this.kvStore.all();

            while (iter.hasNext()) {
                KeyValue entry = iter.next();
                context.forward(entry.key, entry.value.toString());
            }

            iter.close();
            context.commit();
        }

        @Override
        public void close() {
            this.kvStore.close();
        }
    };

在上面的代码实现中，执行了以下的操作：

在init方法，定义每1秒调度 punctuate ，并检索名为“Counts”的本地状态存储。
在process方法中，每个接收一个记录，将字符串的值分割成单词，并更新他们的数量到状态存储（稍后我们将讨论这个特性的部分）。
在puncuate方法，迭代本地状态仓库并发送总量数到下游的处理器，并提交当前的流状态。

Processor Topology（处理器拓扑）

通过Processor API定义的自定义的处理器，开发人员将使用TopologyBuilder通过连接这些处理器共同构建一个处理器拓扑。（类似于主方法）

    TopologyBuilder builder = new TopologyBuilder();

    builder.addSource("SOURCE", "src-topic")

        .addProcessor("PROCESS1", MyProcessor1::new /* the ProcessorSupplier that can generate MyProcessor1 */, "SOURCE")
        .addProcessor("PROCESS2", MyProcessor2::new /* the ProcessorSupplier that can generate MyProcessor2 */, "PROCESS1")
        .addProcessor("PROCESS3", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1")

        .addSink("SINK1", "sink-topic1", "PROCESS1")
        .addSink("SINK2", "sink-topic2", "PROCESS2")
        .addSink("SINK3", "sink-topic3", "PROCESS3");

上面代码，是通过几个步骤来构建拓扑：

首先，所有的源节点命名为“SOURCE”并使用addSource方法添加到拓扑中，主题“src-topic”来提供记录（消息）。
3个processor节点，使用addProcessor方法添加；这里的第一个processor是”SOURCE”节点的子节点，但是其他两个处理器的父类。
最后，使用addSink方法将3个sink节点添加到完整的拓扑中。每个管道从不同父类处理器节点输出到不同的topic。

本地状态存储

请注意，Processor API不仅限于当有消息到达时候调用process()方法,也可以保存记录到本地状态仓库（如汇总或窗口连接）。利用这个特性，开发者可以使用StateStore接口定义一个状态仓库（Kafka Streams库也有一些扩展的接口，如KeyValueStore）。在实际开发中，开发者通常不需要从头开始自定义这样的状态仓库，可以很简单使用Stores工厂来设定状态仓库是持久化的或日志备份等。在下面的例子中，创建一个名为”Counts“的持久化的key-value仓库，key类型String和value类型Long。

StateStoreSupplier countStore = Stores.create("Counts")
    .withKeys(Serdes.String())
    .withValues(Serdes.Long())
    .persistent()
    .build();

为了利用这些状态仓库，开发者可以在构建处理器拓扑时使用TopologyBuilder.addStateStore方法来创建本地状态，并将它与需要访问它的处理器节点相关联，或者也可以通过TopologyBuilder.connectProcessorAndStateStores将创建的状态仓库与现有的处理器节点连接。

  TopologyBuilder builder = new TopologyBuilder();

    builder.addSource("SOURCE", "src-topic")

        .addProcessor("PROCESS1", MyProcessor1::new, "SOURCE")
        // create the in-memory state store "COUNTS" associated with processor "PROCESS1"
        .addStateStore(Stores.create("COUNTS").withStringKeys().withStringValues().inMemory().build(), "PROCESS1")
        .addProcessor("PROCESS2", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1")
        .addProcessor("PROCESS3", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1")

        // connect the state store "COUNTS" with processor "PROCESS2"
        .connectProcessorAndStateStores("PROCESS2", "COUNTS");

        .addSink("SINK1", "sink-topic1", "PROCESS1")
        .addSink("SINK2", "sink-topic2", "PROCESS2")
        .addSink("SINK3", "sink-topic3", "PROCESS3");

在下一节，我们使用另一种方式来构建处理器拓扑：Kafka Streams DSL

高级别Streams DSL

使用Streams DSL构建一个处理器拓扑，开发者可以使用KStreamBuilder类，它是TopologyBuilder的扩展。在Kafka源码的streams/examples包中有一个简单的例子。另外本节剩余的部分将通过一些代码来展示使用Streams DSL创建拓扑的关键的步骤。但是我们推荐开发者阅读更详细完整的源码。

Duality of Streams and Tables（流和表的对偶性）

我们讨论Kafka Streams聚合等概念之前，我们必须首先介绍表，和最重要的表和流之间的关系：所谓的流表对偶性。本质上，这种二元性意味着一个流可以被视为一个表，反之亦然。例如，Kafka的日志压缩功能也利用了对偶性。

表的格式是一个简单的key-value对的集合，也称为map或关系数组。看起来像这样：

screenshot

流表二元性描述了流和表之间的紧密关系。

流作为表：一个流可以认为是一个表的变更日志，其中在流中的每个的数据记录捕获表的状态变化。因此，流其实是一个伪装的表，并且可以通过从开始到结束重放变更日志来很容地重构“真实”表。同样，在更多类比中，在流中聚合数据记录 - 例如根据用户的访问事件统计总量。- 将返回一个表。（这里的key和value分别是用户和其对应的网页游览量。）
表作为流：表可以认为是在流中的每个key的最新value的一个时间点的快照（流的数据记录是key-value对）。因此，表也可以认为是伪装的流，它可以通过对表中每个key-value进行迭代而容易的转换成“真实”流。

让我们用一个例子来说明这一点，假设有一张表，用于跟踪用户的总游览量（下图第一列）。随着时间的推移，每当处理新的网页游览时，相应的更新表的状态。这里，不同时间点之间状态的改变 - 以及表的不同的更新- 表示为变更日志流（第二列）。

screenshot

有趣的是，由于流表的对偶性，同一个流可以用来重建原始表（第三列）：
screenshot

例如，使用相同的机制，通过变更日志捕获（CDC）复制数据库，并在Kafka Streams中，在机器之间复制其所谓的状态存储，以实现容错。
流表的对偶性是一个重要的概念，Kafka Streams通过KStream，KTable，和GlobalKTable接口模型。我们将在下面的章节中描述。

KStream, KTable, GlobalKTable

DSL有3个主要的抽象概念。KStream是一个消息流抽象，其中每个数据记录代表在无界数据集里的自包含数据。KTable是一个变更日志流的抽象，其中每个数据记录代表一个更新。更确切的说，数据记录中的value是相同记录key的最后一条的更新（如果key存在，如果key还不存在，则更新将被认为是创建）。类似于Ktable，GlobalKTable也是一个变更日志流的抽象。其中每个数据记录代表一个更新。但是，不同于KTable，它是完全的复制每个KafkaStreams实例。同样，GlobalKTable也提供了通过key查找当前数据值的能力（通过join操作）。为了说明KStreams和KTables/ GlobalKTables之间的区别，让我们想想一下两个数据记录发送到流中：

("alice", 1) --> ("alice", 3)

假设流处理应用程序是求总和，如果这个是KStream，它将返回4。如果是KTable或GlobalKTable，将返回的是3，因为最后的记录被认为是一个更新动作。

创建源流

记录流（KStreams）或变更日志流（KTable或GlobalkTable）可以从一个或多个Kafka主题创建源流，（而KTable和GlobalKTable，只能从单个主题创建源流）。

KStreamBuilder builder = new KStreamBuilder();

KStream<String, GenericRecord> source1 = builder.stream("topic1", "topic2");
KTable<String, GenericRecord> source2 = builder.table("topic3", "stateStoreName");
GlobalKTable<String, GenericRecord> source2 = builder.globalTable("topic4", "globalStoreName");

Windowing a stream（窗口流）

流处理器可能需要将数据记录划分为时间段。即，通过时间窗口。通常用于连接和聚合操作等。Kafka Streams当前定义了一下的类型窗口：

跳跃时间窗口是基于时间间隔的窗口。此模式固定大小，（可能）重叠的窗口。通过2个属性来定义跳跃窗口：窗口的大小和其前进间隔（又叫“跳跃”）。前进间隔是根据前一个窗口来指定向前移动多少。例如，你可以配置一个跳跃窗口，大小为5分钟，前进间隔是1分钟。由于跳跃窗口可以重叠。因此数据记录可以属于多于一个这样的窗口。
滚动时间窗口是跳跃时间窗口的特殊情况，并且像后者一样，也是基于时间间隔。其模型固定大小，非重叠，无间隔窗口。滚动窗口是通过单个属性来定义的：窗口的大小。滚动窗口等于其前进间隔的跳跃窗口大小。由于滚动窗口不会重叠，数据记录仅属于一个且仅有一个窗口。
滑动窗口模式是基于时间轴的连续滑动的固定大小的窗口。如果它们的时间戳的差在窗口大小内，则两个数据记录包含在同一个窗口中。因此，滑动窗口不和epoch对准，而是与数据时间戳对准。在Kafka Streams中，滑动窗口仅用于join操作，并且可通过JoinWindows类指定。
会话窗口(Session windows)是基于key事件聚合成会话。会话表示一个活动期间，由不活动间隔分割定义的。在任何现有会话的不活动间隔内处理的任何事件都将合并到现有的会话中。如果事件在会话间隔之外，那么将创建新的会话。会话窗口独立的跟踪的key（即，不同key的窗口通常开始和结束时间不同）和它们大小的变化（即使相同的key的窗口大小通常都不同）。因为这样session窗口不能被预先计算，而是从数据记录的时间戳分析获取的。

在Kafka Streams DLS中，开发者可以指定保留窗口的周期。允许保留旧的窗口段一段时间。为了等待晚到的记录（时间戳落在窗口间隔内的）。如果记录过了保留周期之后到达，则不能处理，并将该其删除。

在实时数据流中，晚到的记录始终是可能的。这取决于如何有效的处理延迟记录。利用处理时间，语义是何时处理数据，这意味着延迟记录的概念不适用这个，因为根据定义，没有记录会晚到。因此，晚到的记录实际上可以被认为是事件时间或咽下时间（ingestion-time）。在这两种情况下，Kafka Streams能正常处理晚到的消息。

Join multiple streams（连接多个流）

join(连接，加入)操作基于其数据记录的key来合并两个流，并产生一个新的流。在记录流上通常需要在窗口的基础上执行连接，否则为了执行连接必须保持记录的数量可以无限增长。在Kafka Streams中，可以执行以下连接操作：

KStream对Kstreams连接始终基于窗口，否则内存和状态需要计算加入的无限增长大小。这里，从流中新接收的记录与指定窗口间隔内的其他流的记录相连接，为每个匹配生成一个结果（基于用户提供的ValueJoiner）。新KStream实例表示从此操作者返回join流的结果。
KTable对KTable连接连接操作设计和关系型数据库中连接操作一致。这里，两个变更日志流首先是本地状态存储。当从流中接收新的记录时，它与其他流的状态仓库相结合，为每个匹配对生成一个结果（基于用户提供的ValueJoiner）。新KTable实例表示连接流的结果，它也代表表的变更日志流，从此操作人返回。
KStream对KTable连接允许当你从另一个记录流（KStream）接受到新记录时，针对变更日志刘（KTabloe）执行表查询。例如，用最新的用户个人信息（KTable）来填充丰富用户的活动流（KStream)。只有从记录流接受的记录触发连接并通过ValueJoiner生成结果，反之（即，从变更日志流接收的记录将只更新状态仓库）。新的KStream表示该操作者返回的接入结果流。
KStream对GlobalKTable连接允许你基于从其他记录流（KStream）接受到新记录时，针对一个完整复制的变更日志流（GlobalKTable）执行表查询。连接GlobalKTable不需要重新分配输入KStream，因为GlobalKTable的所有分区在每个KafkaStreams实例中都可用。与连接操作一起提供的KeyValueMapper应用到每个KStream记录，提取用于查找GlobalKTable的连接key，从而可以进行非记录key连接。例如，用最新的用户个人信息（GlobalKTable）来丰富用户活跃流（KStream）。只有从记录流接收的记录触发连接并产生结果（通过ValueJoiner），反之亦然（即，从变更日志流接收的记录仅被用于更新状态仓库）。新的KStream实例代表从该操作者返回的连接结果流。

根据操作数，支持以下连接操作：内部连接，外部连接和左连接。类似于关系型数据库。

聚合流

聚合操作采用一个输入流，并通过将多个输入记录合并成单个输出记录来产生一个新的流。计算数量或总数的例子，记录流上通常需要在窗口基础上执行聚合，否则为了执行聚合操作必须保持记录数可以无限地增长。

在Kafka Streams DSL中，聚合操作的输入流可以是KStream或KTable，但是输出流将始终是KTable，允许Kafka Streams在生成或发出之后，最后抵达的记录更新聚合的值。当这种晚到到达的记录发生，聚合KStream或KTtable只是发出一个新的聚合值。由于输出是KTable，所以在后续的处理步骤中，具有key的旧值将被新值覆盖。

转换流

除了join（连接）和聚合操作之外，KStream和KTable各自提供其他的转换操作。这些操作每一个都可以生成一个或多个KStream和Ktable对象，并可以转换成一个或多个连接的处理器到底层处理器拓扑中。所有这些转换方法可以链接在一起构成一个复杂的处理器拓扑。由于KSteram和KTable是强类型的，所有转换操作都被定义为泛型，用户可以在其中指定输出和输出数据的类型。

这些转换中，filter,map,myValues等是无状态操作，可应用于KStream和KTable，用户通常可以自定义函数作为参数传递给这些函数，如Predicate的filter，MapValueMapper的map等：

// written in Java 8+, using lambda expressions
KStream<String, GenericRecord> mapped = source1.mapValue(record -> record.get("category"));

无状态转换，不需要处理任何状态。因此在实现上它们不需要流处理器的状态仓库。另一方面，有状态的转换，则需要状态仓库。例如，在连接和聚合操作中，使用窗口状态来存储所有目前为止在定义窗口边界内的所有接收的记录。然后，操作员可以访问这些存储的记录，并基于它们进行计算。

// written in Java 8+, using lambda expressions
KTable<Windowed<String>, Long> counts = source1.groupByKey().aggregate(
    () -> 0L,  // initial value
    (aggKey, value, aggregate) -> aggregate + 1L,   // aggregating value
    TimeWindows.of("counts", 5000L).advanceBy(1000L), // intervals in milliseconds
    Serdes.Long() // serde for aggregated value
);

KStream<String, String> joined = source1.leftJoin(source2,
    (record1, record2) -> record1.get("user") + "-" + record2.get("region");
);

将流写回kafka

在处理结束后，开发者可以通过KStream.to和KTable.to将最终的结果流（连续不断的）写回Kafka主题。

 joined.to("topic4");

如果已经通过上面的to方法写入到一个主题中，但是如果你还需要继续读取和处理这些消息，可以从输出主题构建一个新流，Kafka Streams提供了一个便利的方法，through:

    // equivalent to
    //
    // joined.to("topic4");
    // materialized = builder.stream("topic4");
    KStream materialized = joined.through("topic4");

应用程序的配置和执行

除了定义的topology，开发者还将需要在运行它之前在StreamsConfig配置他们的应用程序，Kafka Stream配置的完整列表可以在这里找到。

Kafka Streams中指定配置和生产者、消费者客户端类似，通常，你创建一个java.util.Properties，设置必要的参数，并通过Properties实例构建一个StreamsConfig实例。

import java.util.Properties;
import org.apache.kafka.streams.StreamsConfig;

Properties settings = new Properties();
// Set a few key parameters
settings.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-first-streams-application");
settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker1:9092");
settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "zookeeper1:2181");
// Any further settings
settings.put(... , ...);

// Create an instance of StreamsConfig from the Properties instance
StreamsConfig config = new StreamsConfig(settings);

除了Kafka Streams自己配置参数，你也可以为Kafka内部的消费者和生产者指定参数。根据你应用的需要。类似于Streams设置，你可以通过StreamsConfig设置任何消费者和/或生产者配置。请注意，一些消费者和生产者配置参数使用相同的参数名。例如，用于配置TCP缓冲的send.buffer.bytes或receive.buffer.bytes。用于控制客户端请求重试的request.timeout.ms和retry.backoff.ms。如果需要为消费者和生产者设置不同的值，可以使用consumer.或producer.作为参数名称的前缀。

Properties settings = new Properties();
// Example of a "normal" setting for Kafka Streams
settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker-01:9092");

// Customize the Kafka consumer settings
streamsSettings.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);

// Customize a common client setting for both consumer and producer
settings.put(CommonClientConfigs.RETRY_BACKOFF_MS_CONFIG, 100L);

// Customize different values for consumer and producer
settings.put("consumer." + ConsumerConfig.RECEIVE_BUFFER_CONFIG, 1024 * 1024);
settings.put("producer." + ProducerConfig.RECEIVE_BUFFER_CONFIG, 64 * 1024);
// Alternatively, you can use
settings.put(StreamsConfig.consumerPrefix(ConsumerConfig.RECEIVE_BUFFER_CONFIG), 1024 * 1024);
settings.put(StremasConfig.producerConfig(ProducerConfig.RECEIVE_BUFFER_CONFIG), 64 * 1024);

你可以在应用程序代码中的任何地方使用Kafka Streams，常见的是在应用程序的main（）方法中使用。

首先，先创建一个KafkaStreams实例，其中构造函数的第一个参数用于定义一个topology builder（Streams DSL的KStreamBuilder，或Processor API的TopologyBuilder）。第二个参数是上面提到的StreamsConfig的实例。

import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStreamBuilder;
import org.apache.kafka.streams.processor.TopologyBuilder;

// Use the builders to define the actual processing topology, e.g. to specify
// from which input topics to read, which stream operations (filter, map, etc.)
// should be called, and so on.

KStreamBuilder builder = ...;  // when using the Kafka Streams DSL
//
// OR
//
TopologyBuilder builder = ...; // when using the Processor API

// Use the configuration to tell your application where the Kafka cluster is,
// which serializers/deserializers to use by default, to specify security settings,
// and so on.
StreamsConfig config = ...;

KafkaStreams streams = new KafkaStreams(builder, config);

在这点上，内部结果已经初始化，但是处理还没有开始。你必须通过调用start()方法启动kafka Streams线程：

// Start the Kafka Streams instance
streams.start();

捕获任何意外的异常，设置java.lang.Thread.UncaughtExceptionHandler。每当流线程由于意外终止时，将调用此处理程序。

streams.setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
    public uncaughtException(Thread t, throwable e) {
        // here you should examine the exception and perform an appropriate action!
    }
);

close()方法结束程序。

// Stop the Kafka Streams instance
streams.close();

现在，运行你的应用程序，像其他的Java应用程序一样（Kafka Sterams没有任何特殊的要求）。同样，你也可以打包成jar，通过以下方式运行：

# Start the application in class `com.example.MyStreamsApp`
# from the fat jar named `path-to-app-fatjar.jar`.
$ java -cp path-to-app-fatjar.jar com.example.MyStreamsApp

当应用程序实例开始运行时，定义的处理器拓扑将被初始化成1个或多个流任务，可以由实例内的流线程并行的执行。如果处理器拓扑定义了状态仓库，则这些状态仓库在初始化流任务期间（重新）构建。这一点要理解，当如上所诉的启动你的应用程序时，实际上Kafka Streams认为你发布了一个实例。现实场景中，更常见的是你的应用程序有多个实例并行运行（如，其他的JVM中或别的机器上）。在这种情况下，Kafka Streams会将任务从现有的实例中分配给刚刚启动的新实例。有关详细的信息，请参阅流分区和任务和线程模型。

9.5 升级指南和API变化

如果要从0.10.1.x升级到0.10.2。请参与0.10.2的升级部分。主要强调了升级应用时需要考虑的不兼容性。下面是完整的0.10.2 API和变化列表，提升你的程序或简化代码，并包含了新功能的使用。

0.10.2.0中Streams API的变化

KafkaStreams中的新方法：

设置一个监听器来响应程序状态的变化（#setStateListener(StateListener listener)）。
通过#state检查当前应用程序的状态。
通过#metrics()检索全局度量注册
通过close（long timeout,TimeUnit timeUnit）关闭时的超时时间
通过#toString(String indent)检查Kafka Streams信息时指定自定义的缩进

StreamsConfig中的参数的变化：

zookeeper.connect已废弃；Kafka Streams应用程序的topic管理不在与Zookeeper相互影响，而是使用新的broker管理协议（参见KIP-4,"主题管理模式"一节）
为度量，安全和客户端配置增加了许多新参数

StreamsMetrics接口中的变化

移除方法：#addLatencySensor()
增加方法: #addLatencyAndThroughputSensor(), #addThroughputSensor(), #recordThroughput(), #addSensor(), #removeSensor()

TopologyBuilder中的新方法：

增加了#addSource（）的重载，允许为每个源节点定义一个auto.offset.reset策略
增加了#addGlobalStore（）方法，添加到全局StateStores

KStreamBuilder中的新方法：

增加了#stream()和#table()的重载。允许为每个输入stream/table定义一个auto.offset.reset策略。
added method #globalKTable() to create a GlobalKTable
添加方法#globalKTable（）来创建GlobalKTable

KStream的新连接

增加了用于和KTable连接的#join()重载。
增加了用于和GlobalKTable连接的#join和leftJoin()重载。
注意，0.10.2中的连接有所改进，因此你需要与0.10.0.x和0.10.1.x进行对比，可能会看到不同的结果（参见Apache Kafka wiki中的Kafka Streams Join语义）

KTable连接的空键对其处理

像其他的KTable操作一样，KTable-KTable连接不再对空键抛出异常，而是静静的删除这些记录。

新增窗口类型会话窗口

添加了SessionWindows类来指定会话窗口
增加了KGroupedStream方法的重载：#count（），#reduce（）和#aggregate（），以允许会话窗口聚合。

TimestampExtractor的变化：

#extract()方法增加了第二个参数
新的默认时间戳提取器类FailOnInvalidTimestamp（提供了与旧的（和已经移除的）默认提取器ConsumerRecordTimestampExtractor相同的功能）
新替代时间戳提取器类LogAndSkipOnInvalidTimestamp和UsePreviousTimeOnInvalidTimestamps.

许多DSL接口、类和方法的松散类型约束（参见.KIO-100）。

Streams API更改为0.10.1.0

流分组和聚合分为2个方法：

老的: KStream #aggregateByKey(), #reduceByKey(), #countByKey()
新的: KStream#groupByKey() plus KGroupedStream #aggregate(), #reduce(), 和 #count()
例子: stream.countByKey() 更改为 stream.groupByKey().count()

自动重新分配:

在变更密钥操作之后和aggregation/join之前，不需要调用through()
例子：stream.selectKey(...).through(...).countByKey() 变更为 stream.selectKey().groupByKey().count()

TopologyBuilder:

#sourceTopics(String applicationId) and #topicGroups(String applicationId)方法简化为 #sourceTopics()和 #topicGroups()

DSL: 指向状态仓库名的新参数：

新的Interactive Queries功能需要为所有源KTables和窗口聚合结果KTables指定一个存储名称（之前的参数“operator/window name”现在是storeName）
KStreamBuilder#table(String topic) 变更为 #topic(String topic, String storeName)
KTable#through(String topic) 变更为 #through(String topic, String storeName)
KGroupedStream #aggregate(), #reduce(), 和 #count() 需要增加额外的参数 "String storeName"
例子: stream.countByKey(TimeWindows.of("windowName", 1000)) 变更为 stream.groupByKey().count(TimeWindows.of(1000), "countStoreName")

窗口:

Windows不再命名: TimeWindows.of("name", 1000) 变更为 TimeWindows.of(1000) (参见DSL:新参数指定状态仓库的名称)
JoinWindows没有默认的大小: JoinWindows.of("name").within(1000) 变更为 to JoinWindows.of(1000)

kafka源码编译

Apache Kafka

首先，你需要安装 Gradle 和 Java。

Kafka需要 Gradle 2.0 或更高的版本。

至少 Java 7，以便支持 Java 7 和Java 8。

原文地址：https://github.com/apache/kafka

首先引导并下载

git clone git@github.com:apache/kafka.git

cd kafka_source_dir
gradle

构建jar并运行它

./gradlew jar

构建源码jar

./gradlew srcJar

构建聚合javadoc

./gradlew aggregatedJavadoc

构建javadoc和scaladoc

./gradlew javadoc
./gradlew javadocJar # builds a javadoc jar for each module
./gradlew scaladoc
./gradlew scaladocJar # builds a scaladoc jar for each module
./gradlew docsJar # builds both (if applicable) javadoc and scaladoc jars for each module

运行单元/集成测试

./gradlew test # runs both unit and integration tests
./gradlew unitTest
./gradlew integrationTest

强制重新运行测试，无需变更代码

./gradlew cleanTest test
./gradlew cleanTest unitTest
./gradlew cleanTest integrationTest

运行指定的单元/集成测试

./gradlew -Dtest.single=RequestResponseSerializationTest core:test

在单元/集成测试中运行指定的测试方法

./gradlew core:test --tests kafka.api.ProducerFailureHandlingTest.testCannotSendToInternalTopic
./gradlew clients:test --tests org.apache.kafka.clients.MetadataTest.testMetadataUpdateWaitTime

使用log4j输出运行特定的单元/集成测试

在clients/src/test/resources/log4j.properties或core/src/test/resources/log4j.properties中变更log4j设置

./gradlew -i -Dtest.single=RequestResponseSerializationTest core:test

生成测试覆盖率报告

为整个项目生成覆盖率报告：

./gradlew reportCoverage

生成单个模块的覆盖范围，即：

./gradlew clients:reportCoverage

构建一个二进制的 gzipped tar 包

./gradlew clean
./gradlew releaseTarGz

如果您尚未设置签名密钥，上述命令将失败。要绕过工件的签名，可以运行：

./gradlew releaseTarGz -x signArchives

发布文件可以在./core/build/distributions/内找到。

清除构建

./gradlew clean

在特定版本的Scala（2.11.x或2.12.x）上运行任务

请注意，如果使用2.11.11以外的版本构建jar，则需要设置SCALA_VERSION变量或者在bin/kafka-run-class.sh中更改它以运行快速启动。

您可以传递主要版本(例如2.11)或完整版本(例如2.11.11)：

./gradlew -PscalaVersion=2.11 jar
./gradlew -PscalaVersion=2.11 test
./gradlew -PscalaVersion=2.11 releaseTarGz

Scala 2.12.x 需要 Java 8.

为特定的项目运行任务

这用于 core, examples 和 clients

./gradlew core:jar
./gradlew core:test

列出所有gradle任务

./gradlew tasks

IDE构建项目

请注意，这并不是绝对必要的（例如，IntelliJ IDEA对Gradle项目有良好的内置支持）。

./gradlew eclipse
./gradlew idea

eclipse任务配置使用${project_dir}/build_eclipse作为Eclipse的构建目录。 Eclipse的默认构建目录(${project_dir}/bin)与Kafka的脚本目录冲突，我们不使用Gradle的构建目录来避免这个问题。

为所有的Scala版本和所有项目构建jar

./gradlew jarAll

运行所有scala版本和所有项目的单元/集成测试

./gradlew testAll

为所有scala版本构建二进制发布gzipped tar包

./gradlew releaseTarGzAll

将所有版本的Scala和所有项目的jar发布到maven

./gradlew uploadArchivesAll

请注意这个工作，你应该创建/更新${GRADLE_USER_HOME}/gradle.properties (通常, ~/.gradle/gradle.properties)并分配以下变量：

mavenUrl=
mavenUsername=
mavenPassword=
signing.keyId=
signing.password=
signing.secretKeyRingFile=

将流快速启动原型artifact发布到maven

对于Streams原始项目，不能使用gradle上传到maven; 而需要在quickstart文件夹中调用mvn deploy命令:
cd streams/quickstart
mvn deploy

请注意，为此，您应该创建/更新用户maven设置（通常为${USER_HOME}/.m2/settings.xml）以分配以下变量

<settings xmlns="https://maven.apache.org/SETTINGS/1.0.0"
   xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="https://maven.apache.org/SETTINGS/1.0.0
                       https://maven.apache.org/xsd/settings-1.0.0.xsd">
...                           
<servers>
   ...
   <server>
      <id>apache.snapshots.https</id>
      <username>${maven_username}</username>
      <password>${maven_password}</password>
   </server>
   <server>
      <id>apache.releases.https</id>
      <username>${maven_username}</username>
      <password>${maven_password}</password>
    </server>
    ...
 </servers>
 ...

将jar安装到本地的Maven存储库

./gradlew installAll

构建测试jar

./gradlew testJar

确定如何添加传递依赖关系

./gradlew core:dependencies --configuration runtime

确定是否可以更新依赖关系

./gradlew dependencyUpdates

运行代码质量检查

我们经常运行两个代码质量分析工具，findbugs和checkstyle。

Checkstyle

Checkstyle在kafka执行一致的编码风格。可以使用以下方式运行checkstyle：

./gradlew checkstyleMain checkstyleTest

checkstyle警告将在子项目构建目录中的“reports/checkstyle/reports/main.html”和“reports/checkstyle/reports/test.html”文件中找到。也会打印到控制台。如果Checkstyle失败，构建将失败。

Findbugs

Findbugs使用静态分析来查找代码中的错误。您可以使用以下命令运行findbugs：

./gradlew findbugsMain findbugsTest -x test

findbugs警告将在子报表生成目录中的“reports/findbugs/main.html”和“reports/findbugs/test.html”文件中找到。使用-PxmlFindBugsReport = true生成XML报告，而不是HTML。

常见的构建选项

应使用-P开关设置以下选项，例如./gradlew -PmaxParallelForks=1 test。

commitId：如果构建目的添加了本地提交，则将build commit ID设置为.git/HEAD可能不正确。
mavenUrl: 设置maven部署存储库的URL（file://path/to/repo可用于指向本地存储库）。
maxParallelForks: 限制每个任务的最大进程数。
showStandardStreams: 在控制台上显示测试JVM的标准错误和标准错误。
skipSigning: 跳过artifacts的签名。
testLoggingEvents: 单元测试事件要记录，用逗号分隔。例如./gradlew -PtestLoggingEvents=started,passed,skipped,failed test。
xmlFindBugsReport: 启用findBugs的XML报告。同时会禁用HTML报告，因为一次只能启用一个。

kafka源码在idea运行并断点调试

kafka源码引入idea运行并断点调试

首先，安装好scala，和编译工具gradle。

我用的是interllij ideal进行的代码调试。通过上一章的编译没问题之后，那么我们开始正式的入门调试。

首先，通过idea的方式，启动kafka。

启动服务

找到bin/中找到kafka-server-start.sh,直接右键运行。

screenshot

报错

/bin/bash /Users/weiwei/workspace/project/open/kafka/bin/kafka-server-start.sh
USAGE: /Users/weiwei/workspace/project/open/kafka/bin/kafka-server-start.sh [-daemon] server.properties [--override property=value]*

Process finished with exit code 1

很明显，没有设置配置文件，加上，我的目录在这里

screenshot

/bin/bash /Users/xxx/workspace/project/open/kafka/bin/kafka-server-start.sh /Users/weiwei/workspace/project/open/kafka/config/server.properties
SLF4J: See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[2017-09-05 16:34:29,311] INFO KafkaConfig values: 
    advertised.host.name = null
    advertised.listeners = null
    advertised.port = null
    alter.config.policy.class.name = null
    authorizer.class.name = 
    auto.create.topics.enable = true
    auto.leader.rebalance.enable = true
    background.threads = 10
    broker.id = 0
    broker.id.generation.enable = true
    broker.rack = null
    compression.type = producer
    connections.max.idle.ms = 600000
    controlled.shutdown.enable = true
    controlled.shutdown.max.retries = 3
    controlled.shutdown.retry.backoff.ms = 5000
    controller.socket.timeout.ms = 30000
    create.topic.policy.class.name = null
    default.replication.factor = 1
    delete.records.purgatory.purge.interval.requests = 1
    delete.topic.enable = true
    fetch.purgatory.purge.interval.requests = 1000
    group.initial.rebalance.delay.ms = 0
    group.max.session.timeout.ms = 300000
    group.min.session.timeout.ms = 6000
    host.name = 
    inter.broker.listener.name = null
    inter.broker.protocol.version = 1.0-IV0
    leader.imbalance.check.interval.seconds = 300
    leader.imbalance.per.broker.percentage = 10
    listener.security.protocol.map = SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,TRACE:TRACE,SASL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT
    listeners = null
    log.cleaner.backoff.ms = 15000
    log.cleaner.dedupe.buffer.size = 134217728
    log.cleaner.delete.retention.ms = 86400000
    log.cleaner.enable = true
    log.cleaner.io.buffer.load.factor = 0.9
    log.cleaner.io.buffer.size = 524288
    log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
    log.cleaner.min.cleanable.ratio = 0.5
    log.cleaner.min.compaction.lag.ms = 0
    log.cleaner.threads = 1
    log.cleanup.policy = [delete]
    log.dir = /tmp/kafka-logs
    log.dirs = /tmp/kafka-logs
    log.flush.interval.messages = 9223372036854775807
    log.flush.interval.ms = null
    log.flush.offset.checkpoint.interval.ms = 60000
    log.flush.scheduler.interval.ms = 9223372036854775807
    log.flush.start.offset.checkpoint.interval.ms = 60000
    log.index.interval.bytes = 4096
    log.index.size.max.bytes = 10485760
    log.message.format.version = 1.0-IV0
    log.message.timestamp.difference.max.ms = 9223372036854775807
    log.message.timestamp.type = CreateTime
    log.preallocate = false
    log.retention.bytes = -1
    log.retention.check.interval.ms = 300000
    log.retention.hours = 168
    log.retention.minutes = null
    log.retention.ms = null
    log.roll.hours = 168
    log.roll.jitter.hours = 0
    log.roll.jitter.ms = null
    log.roll.ms = null
    log.segment.bytes = 1073741824
    log.segment.delete.delay.ms = 60000
    max.connections.per.ip = 2147483647
    max.connections.per.ip.overrides = 
    message.max.bytes = 1000012
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    min.insync.replicas = 1
    num.io.threads = 8
    num.network.threads = 3
    num.partitions = 1
    num.recovery.threads.per.data.dir = 1
    num.replica.fetchers = 1
    offset.metadata.max.bytes = 4096
    offsets.commit.required.acks = -1
    offsets.commit.timeout.ms = 5000
    offsets.load.buffer.size = 5242880
    offsets.retention.check.interval.ms = 600000
    offsets.retention.minutes = 1440
    offsets.topic.compression.codec = 0
    offsets.topic.num.partitions = 50
    offsets.topic.replication.factor = 1
    offsets.topic.segment.bytes = 104857600
    port = 9092
    principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
    producer.purgatory.purge.interval.requests = 1000
    queued.max.request.bytes = -1
    queued.max.requests = 500
    quota.consumer.default = 9223372036854775807
    quota.producer.default = 9223372036854775807
    quota.window.num = 11
    quota.window.size.seconds = 1
    replica.fetch.backoff.ms = 1000
    replica.fetch.max.bytes = 1048576
    replica.fetch.min.bytes = 1
    replica.fetch.response.max.bytes = 10485760
    replica.fetch.wait.max.ms = 500
    replica.high.watermark.checkpoint.interval.ms = 5000
    replica.lag.time.max.ms = 10000
    replica.socket.receive.buffer.bytes = 65536
    replica.socket.timeout.ms = 30000
    replication.quota.window.num = 11
    replication.quota.window.size.seconds = 1
    request.timeout.ms = 30000
    reserved.broker.max.id = 1000
    sasl.enabled.mechanisms = [GSSAPI]
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.principal.to.local.rules = [DEFAULT]
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.mechanism.inter.broker.protocol = GSSAPI
    security.inter.broker.protocol = PLAINTEXT
    socket.receive.buffer.bytes = 102400
    socket.request.max.bytes = 104857600
    socket.send.buffer.bytes = 102400
    ssl.cipher.suites = null
    ssl.client.auth = none
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = null
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
    transaction.abort.timed.out.transaction.cleanup.interval.ms = 60000
    transaction.max.timeout.ms = 900000
    transaction.remove.expired.transaction.cleanup.interval.ms = 3600000
    transaction.state.log.load.buffer.size = 5242880
    transaction.state.log.min.isr = 1
    transaction.state.log.num.partitions = 50
    transaction.state.log.replication.factor = 1
    transaction.state.log.segment.bytes = 104857600
    transactional.id.expiration.ms = 604800000
    unclean.leader.election.enable = false
    zookeeper.connect = localhost:2181
    zookeeper.connection.timeout.ms = 6000
    zookeeper.session.timeout.ms = 6000
    zookeeper.set.acl = false
    zookeeper.sync.time.ms = 2000
 (kafka.server.KafkaConfig)


。。。。省 太多了


[2017-09-05 16:34:30,741] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions test-0 (kafka.server.ReplicaFetcherManager)
[2017-09-05 16:34:30,749] INFO Replica loaded for partition test-0 with initial high watermark 1 (kafka.cluster.Replica)
[2017-09-05 16:34:30,750] INFO Partition [test,0] on broker 0: test-0 starts at Leader Epoch 1 from offset 1. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
[2017-09-05 16:34:30,766] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions test-0 (kafka.server.ReplicaFetcherManager)
[2017-09-05 16:34:30,766] INFO Partition [test,0] on broker 0: test-0 starts at Leader Epoch 2 from offset 1. Previous Leader Epoch was: 1 (kafka.cluster.Partition)

启动成功。

运行生产者

同样，bin目录下，找到kafka-console-producer.sh。运行

screenshot

发起生产。

运行消费者

bin目录下，找到kafka-console-consumer.sh。运行

参数中填写“--bootstrap-server localhost:9092 --topic test --new-consumer --from-beginning”
然后即可消费到刚刚生产者发送的所有消息。

kafka命令大全

整理kafka相关的常用命令

管理

## 创建主题（4个分区，2个副本）
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 4 --topic test

查询

## 查询集群描述
bin/kafka-topics.sh --describe --zookeeper 

## topic列表查询
bin/kafka-topics.sh --zookeeper 127.0.0.1:2181 --list

## topic列表查询（支持0.9版本+）
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

## 新消费者列表查询（支持0.9版本+）
bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list

## 新消费者列表查询（支持0.10版本+）
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

## 显示某个消费组的消费详情（仅支持offset存储在zookeeper上的）
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper localhost:2181 --group test

## 显示某个消费组的消费详情（0.9版本 - 0.10.1.0 之前）
bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group test-consumer-group

## 显示某个消费组的消费详情（0.10.1.0版本+）
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group

发送和消费

## 生产者
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

## 消费者
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test

## 新生产者（支持0.9版本+）
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test --producer.config config/producer.properties

## 新消费者（支持0.9版本+）
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --new-consumer --from-beginning --consumer.config config/consumer.properties

## 高级点的用法
bin/kafka-simple-consumer-shell.sh --brist localhost:9092 --topic test --partition 0 --offset 1234  --max-messages 10

平衡leader

bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot

kafka自带压测命令

bin/kafka-producer-perf-test.sh --topic test --num-records 100 --record-size 1 --throughput 100  --producer-props bootstrap.servers=localhost:9092

分区扩容

bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic1 --partitions 2

迁移分区

创建规则json

cat > increase-replication-factor.json <<EOF
{"version":1, "partitions":[
{"topic":"__consumer_offsets","partition":0,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":1,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":2,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":3,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":4,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":5,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":6,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":7,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":8,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":9,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":10,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":11,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":12,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":13,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":14,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":15,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":16,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":17,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":18,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":19,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":20,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":21,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":22,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":23,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":24,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":25,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":26,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":27,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":28,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":29,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":30,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":31,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":32,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":33,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":34,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":35,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":36,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":37,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":38,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":39,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":40,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":41,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":42,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":43,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":44,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":45,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":46,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":47,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":48,"replicas":[0,1]},
{"topic":"__consumer_offsets","partition":49,"replicas":[0,1]}]
}
EOF

执行

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --execute

验证

bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --verify

获取kafka版本

有很多kafka安装者都会把kafka路径设置为/usr/local/kafka。无法得知kafka是什么版本，并且也没有像-version类似的命令。

执行以下命令，获得kafka版本。

# 进入kafka目录
cd /usr/local/kafka

# 执行以下命令
find ./libs/ -name \*kafka_\* | head -1 | grep -o '\kafka[^\n]*'
drwxrwxr-x. 6 root root       117 May 18  2016 kafka_2.11-0.10.0.0

获得了版本为2.11-0.10.0.0。

kafka实战kerberos

环境

版本：kafka_2.12-2.3.0
主机名：orchome
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.5.1804 (Core)
Release: 7.5.1804
Codename: Core
Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Apr 20 16:44:24 UTC 2018

kerberos生成principal

## 创建principal
sudo /usr/sbin/kadmin.local -q 'addprinc -randkey zookeeper/orchome@EXAMPLE.COM'
sudo /usr/sbin/kadmin.local -q 'addprinc -randkey kafka/orchome@EXAMPLE.COM'
sudo /usr/sbin/kadmin.local -q 'addprinc -randkey clients/orchome@EXAMPLE.COM'

sudo /usr/sbin/kadmin.local -q "ktadd -k /etc/security/keytabs/kafka_server.keytab kafka/orchome@EXAMPLE.COM"
sudo /usr/sbin/kadmin.local -q "ktadd -k /etc/security/keytabs/kafka_server.keytab zookeeper/orchome@EXAMPLE.COM"
sudo /usr/sbin/kadmin.local -q "ktadd -k /etc/security/keytabs/kafka_client.keytab clients/orchome@EXAMPLE.COM"

## 检查
klist -t -e -k /etc/security/keytabs/kafka_zookeeper.keytab
klist -t -e -k /etc/security/keytabs/kafka_server.keytab
klist -t -e -k /etc/security/keytabs/kafka_client.keytab

各个文件详情

more /etc/krb5.conf

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = EXAMPLE.COM
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[realms]
 EXAMPLE.COM = {
  kdc = orchome
  admin_server = orchome
 }

[domain_realm]
kafka = EXAMPLE.COM
zookeeper = EXAMPLE.COM
clients = EXAMPLE.COM

kadmin.local

Authenticating as principal root/admin@EXAMPLE.COM with password.
kadmin.local:  listprincs 
K/M@EXAMPLE.COM
admin/admin@EXAMPLE.COM
clients/orchome@EXAMPLE.COM
kadmin/admin@EXAMPLE.COM
kadmin/changepw@EXAMPLE.COM
kadmin/orchome@EXAMPLE.COM
kafka/orchome@EXAMPLE.COM
krbtgt/EXAMPLE.COM@EXAMPLE.COM
krbtgt/orchome@EXAMPLE.COM
zookeeper/orchome@EXAMPLE.COM

klist -t -e -k /var/kerberos/krb5kdc/kafka.keytab

Keytab name: FILE:/var/kerberos/krb5kdc/kafka.keytab
KVNO Timestamp         Principal
---- ----------------- --------------------------------------------------------
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (des3-cbc-sha1)
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (arcfour-hmac)
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (des-hmac-sha1)
   3 07/24/16 00:58:30 kafka/orchome@EXAMPLE.COM (des-cbc-md5)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (des3-cbc-sha1)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (arcfour-hmac)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (des-hmac-sha1)
   2 07/24/16 12:23:18 zookeeper/orchome@EXAMPLE.COM (des-cbc-md5)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (des3-cbc-sha1)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (arcfour-hmac)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (des-hmac-sha1)
   2 07/25/16 11:31:37 kafka/127.0.0.1@EXAMPLE.COM (des-cbc-md5)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (des3-cbc-sha1)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (arcfour-hmac)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (des-hmac-sha1)
   3 07/25/16 13:13:31 kafka/orchome@EXAMPLE.COM (des-cbc-md5)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (des3-cbc-sha1)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (arcfour-hmac)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (des-hmac-sha1)
   2 07/25/16 15:07:58 zookeeper/127.0.0.1@EXAMPLE.COM (des-cbc-md5)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (aes128-cts-hmac-sha1-96)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (des3-cbc-sha1)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (arcfour-hmac)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (des-hmac-sha1)
   2 07/25/16 18:47:55 clients@EXAMPLE.COM (des-cbc-md5)

more /etc/kafka/zookeeper_jaas.conf

Server{
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    storeKey=true
    useTicketCache=false
    keyTab="/etc/security/keytabs/kafka_zookeeper.keytab"
    principal="zookeeper/orchome@EXAMPLE.COM";
};

more /etc/kafka/kafka_server_jaas.conf

KafkaServer {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   storeKey=true
   keyTab="/etc/security/keytabs/kafka_server.keytab"
   principal="kafka/orchome@EXAMPLE.COM";
};

// Zookeeper client authentication
Client {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   storeKey=true
   keyTab="/etc/security/keytabs/kafka_server.keytab"
   principal="kafka/orchome@EXAMPLE.COM";
};

more /etc/kafka/kafka_client_jaas.conf

KafkaClient {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   storeKey=true
   keyTab="/etc/security/keytabs/kafka_client.keytab"
   principal="clients/orchome@EXAMPLE.COM";
};

more config/server.properties

listeners=SASL_PLAINTEXT://orchome:9093
security.inter.broker.protocol=SASL_PLAINTEXT
sasl.mechanism.inter.broker.protocol=GSSAPI
sasl.enabled.mechanisms=GSSAPI
sasl.kerberos.service.name=kafka

more start-zk-and-kafka.sh

#!/bin/bash
export KAFKA_HEAP_OPTS='-Xmx256M'
export KAFKA_OPTS='-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/zookeeper_jaas.conf'
bin/zookeeper-server-start.sh config/zookeeper.properties &

sleep 5

export KAFKA_OPTS='-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf'
bin/kafka-server-start.sh config/server.properties

more config/zookeeper.properties

authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl
jaasLoginRenew=3600000

more config/producer.properties/consumer.properties

security.protocol=SASL_PLAINTEXT
sasl.mechanism=GSSAPI
sasl.kerberos.service.name=kafka

more producer2.sh

export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"

bin/kafka-console-producer.sh --broker-list orchome:9093 --topic test --producer.config config/producer.properties

more consumer2.sh

export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"

bin/kafka-console-consumer.sh --bootstrap-server orchome:9093 --topic test --new-consumer --from-beginning --consumer.config config/consumer.properties

本例说明文档来自

kafka使用SASL/Kerberos认证

Kafka Stream演示程序

本教程假定你第一次，且没有搭建现有的Kafka或ZooKeeper。但是，如果你已经启动了Kafka和ZooKeeper，请跳过前两个步骤。

Kafka Streams结合了在客户端编写和部署标准Java和Scala应用程序的简单性以及Kafka服务器端集群技术的优势，使这些应用程序具有高度可伸缩性，弹性，容错性，分布式等特性。

这个快速入门示例将演示如何运行一个流应用程序。一个WordCountDemo的例子（为了方便阅读，使用的是java8 lambda表达式）

// Serializers/deserializers (serde) for String and Long types
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();

// Construct a `KStream` from the input topic "streams-plaintext-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream<String, String> textLines = builder.stream("streams-plaintext-input",
    Consumed.with(stringSerde, stringSerde);

KTable<String, Long> wordCounts = textLines
    // Split each text line, by whitespace, into words.
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))

    // Group the text words as message keys
    .groupBy((key, value) -> value)

    // Count the occurrences of each word (message key).
    .count()

// Store the running counts as a changelog stream to the output topic.
wordCounts.toStream().to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));

从输入的文本计算出一个词出现的次数。但是，不像其他的WordCount的例子，你可能会看到，在有限的数据基础上，执行的演示应用程序的行为略有不同，因为它应该是在一个无限数据的操作，数据流。类似的有界变量，它是一种动态算法，跟踪和更新的单词计数。然而，由于它必须假设潜在的无界输入数据，它会定期输出其当前状态和结果，同时继续处理更多的数据，因为它不知道什么时候它处理过的“所有”的输入数据。

作为第一步，我们将启动Kafka，然后我们将输入数据准备到Kafka主题，然后由Kafka Streams应用程序处理。

Step 1: 下载代码

下载1.1.0版本并解压它。注意，有多个可下载的Scala版本，我们选择在这里使用推荐版本（2.11）：

> tar -xzf kafka_2.11-1.1.0.tgz
> cd kafka_2.11-1.1.0

Step 2: 启动kafka服务

Kafka使用Zookeeper，所以第一步启动Zookeeper服务。

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

现在启动 Kafka server:

> bin/kafka-server-start.sh config/server.properties
[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
...

Step 3: 准备输入topic并启动Kafka生产者

接下来，我们创建一个输入主题“streams-plaintext-input”，和一个输出主题"streams-wordcount-output":

> bin/kafka-topics.sh --create \
    --zookeeper localhost:2181 \
    --replication-factor 1 \
    --partitions 1 \
    --topic streams-plaintext-input
Created topic "streams-plaintext-input".

注意：因为输出主题是更新日志流（参见下面的应用程序输出的说明），所以我们为输出主题启用了压缩。

> bin/kafka-topics.sh --create \
    --zookeeper localhost:2181 \
    --replication-factor 1 \
    --partitions 1 \
    --topic streams-wordcount-output \
    --config cleanup.policy=compact
    Created topic "streams-wordcount-output".

也可以使用kafka topic工具查看主题描述：

> bin/kafka-topics.sh --zookeeper localhost:2181 --describe

Topic:streams-plaintext-input   PartitionCount:1    ReplicationFactor:1 Configs:
    Topic: streams-plaintext-input  Partition: 0    Leader: 0   Replicas: 0 Isr: 0
Topic:streams-wordcount-output  PartitionCount:1    ReplicationFactor:1 Configs:
    Topic: streams-wordcount-output Partition: 0    Leader: 0   Replicas: 0 Isr: 0

Step 4: 启动 Wordcount 程序

以下命令启动WordCount演示程序：

> bin/kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo

演示程序将从输入主题streams-plaintext-input中读取，对每个读取消息执行WordCount算法计算，并将其当前结果连续写入输出主题streams-wordcount-output。因此，除了日志条目外，不会有任何STDOUT输出，因为结果会写回到Kafka中。

现在我们另外开一个终端，来启动生产者来为该主题写入一些输入数据：

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-plaintext-input

在开一个终端，读取输出主题的数据。

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
    --topic streams-wordcount-output \
    --from-beginning \
    --formatter kafka.tools.DefaultMessageFormatter \
    --property print.key=true \
    --property print.value=true \
    --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
    --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

Step 5: 处理数据

现在，我们通过输入一行文本然后按，生产一些新的消息到输入主题streams-plaintext-input。其中消息key为空，消息value为刚刚输入的字符串编码文本行（实际上，应用程序的输入数据通常会连续流入Kafka，而不是像我们在这个快速入门中那样手动输入）：

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-plaintext-input
all streams lead to kafka

这些消息将被Wordcount程序处理，然后输出数据到streams-wordcount-output主题中，我们新打开一个命令窗口，输出消费者：

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
    --topic streams-wordcount-output \
    --from-beginning \
    --formatter kafka.tools.DefaultMessageFormatter \
    --property print.key=true \
    --property print.value=true \
    --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
    --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

all     1
streams 1
lead    1
to      1
kafka   1

这里，第一列是java.lang.String格式的Kafka消息key，表示正在计数的单词，第二列是java.lang.Longformat中的消息value，表示该单词的最新计数。

现在，用生产者继续往streams-plaintext-input主题中发消息，输入"hello kafka streams",然后：

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-plaintext-input
all streams lead to kafka
hello kafka streams

在消费者命令窗口，你可以观察WordCount程序写入到输出主题的数据：

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
    --topic streams-wordcount-output \
    --from-beginning \
    --formatter kafka.tools.DefaultMessageFormatter \
    --property print.key=true \
    --property print.value=true \
    --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
    --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

all     1
streams 1
lead    1
to      1
kafka   1
hello   1
kafka   2
streams 2

在这里，最后一行打印行kafka 2和streams 2表示计数已经从1递增到2。每当你向输入主题写入更多的输入消息时，你将观察到新的消息被添加到streams-wordcount-output主题，表示由WordCount应用程序计算出的最新字数。让我们输入一个最终的输入文本行“join kafka summit”，然后在控制台生产者中输入主题streams-wordcount-input之前的：

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-wordcount-input
all streams lead to kafka
hello kafka streams
join kafka summit

streams-wordcount-output主题随后将显示相应的更新变化（请参见最后三行）：

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
    --topic streams-wordcount-output \
    --from-beginning \
    --formatter kafka.tools.DefaultMessageFormatter \
    --property print.key=true \
    --property print.value=true \
    --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
    --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

all     1
streams 1
lead    1
to      1
kafka   1
hello   1
kafka   2
streams 2
join    1
kafka   3
summit  1

可以看到，Wordcount应用程序的输出实际上是一个连续的更新流，其中每个输出记录（即上面原始输出中的每一行）是单个单词的更新计数，也就是诸如“kafka”的记录关键字。对于具有相同密钥的多个记录，每个后面的记录都是前一个记录的更新。

下面的两张图说明了幕后发生的事情。第一列显示KTable <string，long>当前状态的演变，它计数count的单词出现次数。第二列显示从KTable的状态更新以及发送到输出主题streams-wordcount-output的更改记录。

screenshot

首先正在处理文本行“all streams lead to kafka”。KTable正在建立，因为每个新单词都会生成一个新表格（用绿色背景突出显示），并将相应的更改记录发送到下游KStream。

当处理第二行文本“hello kafka streams”时，我们首次观察到KTable中现有的条目正在被更新（这里是：“kafka”和“streams”）。再次，更改记录发送到输出主题。

（我们跳过了第三行如何处理的说明）。这解释了为什么输出主题具有我们上面显示的内容，因为它包含完整的变更记录。

在这个例子的范围之外，Kafka Streams在这里做的是利用表和变更日志流之间的对偶性（这里：table = KTable，changelog stream =下游KStream）：你可以发布table转换为流，并且如果你从头到尾使用整个变更日志流，则可以重新构建表的内容。

Step 6: 停止应用

最后，通过Ctrl-C停止控制台消费者，生产者，Wordcount程序，Kafka Broker和Zokeeper服务。

手把手教你写Kafka Streams程序

在本指南中，我们将从头开始帮助你搭建自己的Kafka Streams流处理程序。强烈建议您首先阅读快速入门，了解如何运行使用Kafka Streams编写的Streams应用程序（如果尚未这样做）。

设置Maven项目

我们将使用Kafka Streams Maven Archetype来创建Streams项目结构：

mvn archetype:generate \
    -DarchetypeGroupId=org.apache.kafka \
    -DarchetypeArtifactId=streams-quickstart-java \
    -DarchetypeVersion=1.1.0 \
    -DgroupId=streams.examples \
    -DartifactId=streams.examples \
    -Dversion=0.1 \
    -Dpackage=myapps

如果你需要，您可以为groupId，artifactId和package设置不同的值。假设您使用上述参数值，该命令将创建一个如下所示的项目结构：

> tree streams.examples
streams-quickstart
|-- pom.xml
|-- src
    |-- main
        |-- java
        |   |-- myapps
        |       |-- LineSplit.java
        |       |-- Pipe.java
        |       |-- WordCount.java
        |-- resources
            |-- log4j.properties

项目中包含的pom.xml文件已经定义了Streams依赖项，并且在src/main/java已经有几个Streams示例程序。既然我们要从头开始编写这样的程序，现在我们先删除这些例子：

> cd streams-quickstart
> rm src/main/java/myapps/*.java

编写第一个Streams应用程序：Pipe

It's coding time now! Feel free to open your favorite IDE and import this Maven project, or simply open a text editor and create a java file under src/main/java. Let's name it Pipe.java:
现在是编码时间！随意打开你最喜欢的IDE并导入这个Maven项目，或者直接打开一个文本编辑器并在src/main/java下创建一个java文件。我们将其命名为Pipe.java：

package myapps;

public class Pipe {

    public static void main(String[] args) throws Exception {

    }
}

我们在main中来编写这个pipe程序。请注意，由于IDE通常可以自动添加导入语句，因此我们不会列出导入语句。但是，如果您使用的是文本编辑器，则需要手动添加导入，并且在本节末尾，我们将为您显示带有导入语句的完整代码段。

编写Streams应用程序的第一步是创建一个java.util.Properties映射来指定StreamsConfig中定义的不同Streams执行配置值。需要设置的几个重要配置值：StreamsConfig.BOOTSTRAP_SERVERS_CONFIG，它指定用于建立初始连接到Kafka集群的host/port列表，以及StreamsConfig.APPLICATION_ID_CONFIG，它提供了Streams的唯一标识符应用程序与其他应用程序进行区分：

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");    // assuming that the Kafka broker

假设这个应用程序和集群在同一台机器运行。

另外，你也可以自定义其他配置，例如设置消息key-value对的默认序列化和反序列：

props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

有关Kafka Streams的完整配置列表，请参阅这里。

接下来我们将定义Streams应用程序的计算逻辑。在Kafka Streams中，这种计算逻辑被定义为连接处理器节点的拓扑结构。我们可以使用拓扑构建器来构建这样的拓扑，

final StreamsBuilder builder = new StreamsBuilder();

然后使用此拓扑构建器,创建主题为streams-plaintext-input的源流(ps：就是数据的来源)：

KStream<String, String> source = builder.stream("streams-plaintext-input");

现在我们得到一个KStream，它不断的从来源主题streams-plaintext-input获取消息。消息是String类型的key-value对。我们可以用这个流做的最简单的事情就是将它写入另一个Kafka主题streams-pipe-output中：

source.to("streams-pipe-output");

请注意，我们也可以将上面两行连接成一行，如下所示：

builder.stream("streams-plaintext-input").to("streams-pipe-output");

我们可以通过执行以下操作来检查此构建器创建的拓扑结构类型：

final Topology topology = builder.build();

将描述输出：

System.out.println(topology.describe());

如果我们现在编译并运行程序，它会输出以下信息：

> mvn clean package
> mvn exec:java -Dexec.mainClass=myapps.Pipe
Sub-topologies:
  Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000(topics: streams-plaintext-input) --> KSTREAM-SINK-0000000001
    Sink: KSTREAM-SINK-0000000001(topic: streams-pipe-output) <-- KSTREAM-SOURCE-0000000000
Global Stores:
  none

如上所示，它说明构建的拓扑有两个处理器节点，源节点KSTREAM-SOURCE-0000000000和sink节点KSTREAM-SINK-0000000001。KSTREAM-SOURCE-0000000000连续读取Kafka主题streams-plaintext-input的消息,并将它们传送到其下游节点KSTREAM-SINK-0000000001; KSTREAM-SINK-0000000001会将其接收到的每条消息写入另一个Kafka主题streams-pipe-output中（ -->和<-- 箭头指示该节点的下游和上游处理器节点，即在拓扑图中的“children”和“parents“）。它还说明，这种简单的拓扑没有与之相关联的全局状态存储（我们将在后面的章节中更多地讨论状态存储）。

请注意，我们总是可以像在上面那样在任何给定点上描述拓扑，而我们正在代码中构建它，因此作为用户，您可以交互式地“尝试并品尝”拓扑中定义的计算逻辑，直到你满意为止。假设我们已经完成了这个简单的拓扑结构，它只是以一种无尽的流式方式将数据从一个Kafka主题管道传输到另一个主题，我们现在可以使用我们刚刚构建的两个组件构建Streams客户端：配置map和拓扑对象（也可以从props map构造一个StreamsConfig对象，然后将该对象传递给构造函数，可以重载KafkaStreams构造函数来实现任一类型）。

final KafkaStreams streams = new KafkaStreams(topology, props);

通过调用它的start()函数，我们可以触发这个客户端的执行。在此客户端上调用close()之前，执行不会停止。例如，我们可以添加一个带有倒计时的shutdown hook来捕获用户中断，并在终止该程序时关闭客户端：

final CountDownLatch latch = new CountDownLatch(1);

// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
    @Override
    public void run() {
        streams.close();
        latch.countDown();
    }
});

try {
    streams.start();
    latch.await();
} catch (Throwable e) {
    System.exit(1);
}
System.exit(0);

View Code

到目前为止，完整的代码如下所示：

package myapps;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;

import java.util.Properties;
import java.util.concurrent.CountDownLatch;

public class Pipe {

    public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        final StreamsBuilder builder = new StreamsBuilder();

        builder.stream("streams-plaintext-input").to("streams-pipe-output");

        final Topology topology = builder.build();

        final KafkaStreams streams = new KafkaStreams(topology, props);
        final CountDownLatch latch = new CountDownLatch(1);

        // attach shutdown handler to catch control-c
        Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
            @Override
            public void run() {
                streams.close();
                latch.countDown();
            }
        });

        try {
            streams.start();
            latch.await();
        } catch (Throwable e) {
            System.exit(1);
        }
        System.exit(0);
    }
}

View Code

如果您已经在localhost:9092上运行了Kafka，并且创建了主题streams-plaintext-input和streams-pipe-output，则可以在IDE或命令行上使用Maven运行此代码：

> mvn clean package
> mvn exec:java -Dexec.mainClass=myapps.Pipe

有关如何运行Streams应用程序并观察计算结果的详细说明，请阅读Play with a Streams部分。本节的其余部分我们不会谈论这一点。

编写第二个Streams应用程序：Line Split

我们已经学会了如何构建Streams客户端及其两个关键组件：StreamsConfig和Topology。现在让我们继续通过增加当前拓扑来添加一些实际的处理逻辑。我们可以首先复制现有的Pipe.java类来创建另一个程序：

    > cp src/main/java/myapps/Pipe.java src/main/java/myapps/LineSplit.java

并更改其类名以及应用程序ID配置以，与之前的程序区分开来：

public class LineSplit {

    public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-linesplit");
        // ...
    }
}

由于每个源流的消息都是一个字符串类型的键值对，因此让我们将值字符串视为文本行，并使用FlatMapValues运算符将其分成单词：

KStream<String, String> source = builder.stream("streams-plaintext-input");
KStream<String, String> words = source.flatMapValues(new ValueMapper<String, Iterable<String>>() {
            @Override
            public Iterable<String> apply(String value) {
                return Arrays.asList(value.split("\\W+"));
            }
        });

操作员将把源流作为输入，并通过按顺序处理源流中的每条消息并将其值字符串分解为一个单词列表，并生成每个单词作为输出的新消息，从而生成一个名为单词的新流。这是一个无状态的操作，无需跟踪以前收到的消息或处理结果。请注意，如果您使用的是JDK 8，则可以使用lambda表达式并简化上面的代码：

KStream<String, String> source = builder.stream("streams-plaintext-input");
KStream<String, String> words = source.flatMapValues(value -> Arrays.asList(value.split("\\W+")));

最后，我们可以将单词流写回另一个Kafka主题，比如说stream-linesplit-output。再次，这两个步骤可以如下所示连接（假设使用lambda表达式）：

KStream<String, String> source = builder.stream("streams-plaintext-input");
source.flatMapValues(value -> Arrays.asList(value.split("\\W+")))
      .to("streams-linesplit-output");

如果我们现在将此扩展拓扑描述打印出来System.out.println(topology.describe())，我们将得到以下结果：

> mvn clean package
> mvn exec:java -Dexec.mainClass=myapps.LineSplit
Sub-topologies:
  Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000(topics: streams-plaintext-input) --> KSTREAM-FLATMAPVALUES-0000000001
    Processor: KSTREAM-FLATMAPVALUES-0000000001(stores: []) --> KSTREAM-SINK-0000000002 <-- KSTREAM-SOURCE-0000000000
    Sink: KSTREAM-SINK-0000000002(topic: streams-linesplit-output) <-- KSTREAM-FLATMAPVALUES-0000000001
  Global Stores:
    none

正如我们上面看到的，一个新的处理器节点KSTREAM-FLATMAPVALUES-0000000001被注入到原始源节点和sink节点之间的拓扑中。它将源节点作为其父节点，将sink节点作为其子节点。换句话说，源节点获取的每个消息，将首先遍历新加入的KSTREAM-FLATMAPVALUES-0000000001节点进行处理，并且结果将生成一个或多个新消息。它们将继续往下走到sink节点回写给kafka。注意这个处理器节点是“无状态的”，因为它不与任何仓库相关联（即（stores:[]））。

完整的代码如下所示（假设使用lambda表达式）：

package myapps;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.KStream;

import java.util.Arrays;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;

public class LineSplit {

    public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-linesplit");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        final StreamsBuilder builder = new StreamsBuilder();

        KStream<String, String> source = builder.stream("streams-plaintext-input");
        source.flatMapValues(value -> Arrays.asList(value.split("\\W+")))
              .to("streams-linesplit-output");

        final Topology topology = builder.build();
        final KafkaStreams streams = new KafkaStreams(topology, props);
        final CountDownLatch latch = new CountDownLatch(1);

        // ... same as Pipe.java above
    }
}

View Code

编写第三个Streams应用程序：Wordcount

现在让我们进一步通过计算源文本流中单词的出现，来向拓扑中添加一些“有状态”计算。按照类似的步骤，我们创建另一个基于LineSplit.java类的程序：

public class WordCount {

    public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
        // ...
    }
}

为了计算单词，我们可以首先修改flatMapValues，将它们全部作为小写字母（假设使用lambda表达式）：

source.flatMapValues(new ValueMapper<String, Iterable<String>>() {
            @Override
            public Iterable<String> apply(String value) {
                return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+"));
            }
        });

我们必须首先指定我们要关键流的字符串value，即小写单词，用groupBy操作。该运算符生成一个新的分组流，然后可以由一个计数操作员汇总，该操作员可以在每个分组键上生成一个运行计数：

KTable<String, Long> counts =
source.flatMapValues(new ValueMapper<String, Iterable<String>>() {
            @Override
            public Iterable<String> apply(String value) {
                return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+"));
            }
        })
      .groupBy(new KeyValueMapper<String, String, String>() {
           @Override
           public String apply(String key, String value) {
               return value;
           }
        })
      // Materialize the result into a KeyValueStore named "counts-store".
      // The Materialized store is always of type <Bytes, byte[]> as this is the format of the inner most store.
      .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>> as("counts-store"));

请注意，count运算符有Materialized参数，该参数指定运行计数应存储在名为counts-store的状态存储中。此Counts仓库可以实时查询，详情请参阅开发者手册。

请注意，为了从主题streams-wordcount-output读取changelog流，需要将值反序列化设置为org.apache.kafka.common.serialization.LongDeserializer。假设可以使用JDK 8的lambda表达式，上面的代码可以简化为：

KStream<String, String> source = builder.stream("streams-plaintext-input");
source.flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
      .groupBy((key, value) -> value)
      .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"))
      .toStream()
      .to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));

如果我们再次将这种扩展拓扑描述为System.out.println(topology.describe())，我们将得到以下结果：

> mvn clean package
> mvn exec:java -Dexec.mainClass=myapps.WordCount
Sub-topologies:
  Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000(topics: streams-plaintext-input) --> KSTREAM-FLATMAPVALUES-0000000001
    Processor: KSTREAM-FLATMAPVALUES-0000000001(stores: []) --> KSTREAM-KEY-SELECT-0000000002 <-- KSTREAM-SOURCE-0000000000
    Processor: KSTREAM-KEY-SELECT-0000000002(stores: []) --> KSTREAM-FILTER-0000000005 <-- KSTREAM-FLATMAPVALUES-0000000001
    Processor: KSTREAM-FILTER-0000000005(stores: []) --> KSTREAM-SINK-0000000004 <-- KSTREAM-KEY-SELECT-0000000002
    Sink: KSTREAM-SINK-0000000004(topic: Counts-repartition) <-- KSTREAM-FILTER-0000000005
  Sub-topology: 1
    Source: KSTREAM-SOURCE-0000000006(topics: Counts-repartition) --> KSTREAM-AGGREGATE-0000000003
    Processor: KSTREAM-AGGREGATE-0000000003(stores: [Counts]) --> KTABLE-TOSTREAM-0000000007 <-- KSTREAM-SOURCE-0000000006
    Processor: KTABLE-TOSTREAM-0000000007(stores: []) --> KSTREAM-SINK-0000000008 <-- KSTREAM-AGGREGATE-0000000003
    Sink: KSTREAM-SINK-0000000008(topic: streams-wordcount-output) <-- KTABLE-TOSTREAM-0000000007
Global Stores:
  none

如上所述，拓扑现在包含两个断开的子拓扑。第一个子拓扑的接收节点KSTREAM-SINK-0000000004将写入一个重新分区主题Counts-repartition，它将由第二个子拓扑的源节点KSTREAM-SOURCE-0000000006读取。重分区topic通过使用聚合键“shuffle”的源流，在这种情况下，聚合键为值字符串。此外，在第一个子拓扑结构内部，在分组KSTREAM-KEY-SELECT-0000000002节点和sink节点之间注入无状态的KSTREAM-FILTER-0000000005节点，以过滤出聚合key为空的任何中间记录。

在第二个子拓扑中，聚合节点KSTREAM-AGGREGATE-0000000003与名为Counts的状态存储相关联（名称由用户在count运算符中指定）。在即将到来的流源节点接收到每个消息时，聚合处理器将首先查询其关联的Counts存储以获得该密钥的当前计数，并将其增加1，然后将新计数写回仓库。将每个更新的key计数传送到KTABLE-TOSTREAM-0000000007节点，KTABLE-TOSTREAM-0000000007节点将该更新流解释为消息流，然后再传输到汇聚节点KSTREAM-SINK-0000000008以写回Kafka。

完整的代码如下所示（假设使用lambda表达式）：

package myapps;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.KStream;

import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;

public class WordCount {

    public static void main(String[] args) throws Exception {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        final StreamsBuilder builder = new StreamsBuilder();

        KStream<String, String> source = builder.stream("streams-plaintext-input");
        source.flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
              .groupBy((key, value) -> value)
              .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"))
              .toStream()
              .to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long());

        final Topology topology = builder.build();
        final KafkaStreams streams = new KafkaStreams(topology, props);
        final CountDownLatch latch = new CountDownLatch(1);

        // ... same as Pipe.java above
    }
}

View Code

kafka外网转发

很多人都因为配置kafka外网转发而困扰，我这里统一讲一下原理和原因。

场景假设

比如你有一个kafka集群，在阿里云上，该集群有2个broker，A和B。

kafka集群：

A内网:172.10.0.2 外网:10.0.21.1
B内网:172.10.0.1 外网:10.0.21.2

server.properties配置

config/server-1.properties: 
    broker.id=1 
    listeners=PLAINTEXT://172.10.0.1:9092

config/server-2.properties: 
    broker.id=2 
    listeners=PLAINTEXT://172.10.0.2:9092

配置内网地址即可，就可以通过外网访问了（10.0.21.1:9092 和 10.0.21.2:9092 可以通，但是如果用kafka客户端连接会报超时）

你想通过你的电脑来访问kafka集群，也就想访问10.0.21.1:9092和10.0.21.2:9092：

注意，我额外加一层场景，转发，复杂一点，但是原理相同。

路由转发，比如：

11.10.21.1  -> 10.0.21.1
11.10.21.2  -> 10.0.21.2

11.10.21.x 为又加了一层转发ip

测试

这时，如果你通过11.10.21.1:9092和11.10.21.2端口都是通的，但是访问kafka发送或消费消息时，会报网络超时，这是为什么呢？

因为kafka客户端是主动发现集群地址的，当你通过11.10.21.1:9092确实是连接到kafka集群了，kafka集群返回给你的ip列表是你listeners配置的，也就是

172.10.0.1:9092
172.10.0.2:9092

你的ip转发和端口都没有用，这就是所有转发外网等等的本质原因。

解决

最简单的方式是客户端通过域名映射的方式。

修改kafka集群服务端的server.properties配置

config/server-1.properties: 
    broker.id=1 
    listeners=PLAINTEXT://kafka-1:9092

config/server-2.properties: 
    broker.id=2 
    listeners=PLAINTEXT://kafka2:9092

kafka集群的服务端，配置hosts

cat /etc/hosts
172.10.0.1 kafka-1
172.10.0.2 kafka-2

本机客户端，配置hosts

cat /etc/hosts
11.10.21.1 kafka-1
11.10.21.2 kafka-2

客户端访问kafka集群时，获取的是kafka-1:9092和kafka-2:9092，通过客户端配置的hosts映射，都转成了对应的外网的ip，因此就可以访问了。

注意：端口要一致，hosts映射只转ip。

kafka实战SSL

生成ca和信任库

#!/bin/bash
#Step 1
keytool -keystore server.keystore.jks -alias localhost -validity 365 -genkey
#Step 2
openssl req -new -x509 -keyout ca-key -out ca-cert -days 365
keytool -keystore server.truststore.jks -alias CARoot -import -file ca-cert
keytool -keystore client.truststore.jks -alias CARoot -import -file ca-cert
#Step 3
keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file
openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days 365 -CAcreateserial -passin pass:test1234
keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert
keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed

more /etc/kafka/kafka_server_jaas.conf

KafkaServer { 
    org.apache.kafka.common.security.plain.PlainLoginModule required
    username="admin"
    password="admin-secret"
    user_admin="admin-secret"
    user_alice="alice-secret";
};

more config/server.properties

listeners=SSL://localhost:9093
ssl.keystore.location=/var/private/ssl/server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/private/ssl/server.truststore.jks
ssl.truststore.password=test1234
security.inter.broker.protocol=SSL

启动kafka

export KAFKA_OPTS='-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf'
bin/kafka-server-start.sh config/server.properties

more client-ssl.properties

security.protocol=SSL
ssl.truststore.location=/var/private/ssl/client.truststore.jks
ssl.truststore.password=test1234

消费者和生产者

bin/kafka-console-producer.sh --broker-list localhost:9093 --topic test --producer.config client-ssl.properties 

bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic test --consumer.config client-ssl.properties

本例说明文档来自

kafka使用SSL加密和认证

kafka实战SASL/PLAIN认证

more config/server.properties

ssl.keystore.location=/var/private/ssl/server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/private/ssl/server.truststore.jks
ssl.truststore.password=test1234

listeners=SASL_SSL://localhost:9093
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN

more /etc/kafka/kafka_server_jaas.conf

KafkaServer { 
    org.apache.kafka.common.security.plain.PlainLoginModule required
    username="admin"
    password="admin-secret"
    user_admin="admin-secret"
    user_alice="alice-secret";
};

more /etc/kafka/kafka_client_jaas.conf

KafkaClient {
    org.apache.kafka.common.security.plain.PlainLoginModule required
    username="alice"
    password="alice-secret";
};

consumer.properties 和 producer.properties

security.protocol=SASL_SSL
sasl.mechanism=PLAIN

ssl.truststore.location=/var/private/ssl/client.truststore.jks
ssl.truststore.password=test1234

启动kafka

export KAFKA_OPTS='-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf'
bin/kafka-server-start.sh config/server.properties

kafka消费者和生产者

export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"
bin/kafka-console-producer.sh --broker-list localhost:9093 --topic test --producer.config config/producer.properties 

export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"
bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic test --consumer.config config/consumer.properties

本例说明文档参考

kafka使用SASL/PLAIN认证

kafka实战SASL/SCRAM

创建证书

bin/kafka-configs.sh --zookeeper localhost:2181 --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=alice-secret],SCRAM-SHA-512=[password=alice-secret]' --entity-type users --entity-name alice

bin/kafka-configs.sh --zookeeper localhost:2181 --alter --add-config 'SCRAM-SHA-256=[password=admin-secret],SCRAM-SHA-512=[password=admin-secret]' --entity-type users --entity-name admin

验证证书

bin/kafka-configs.sh --zookeeper localhost:2181 --describe --entity-type users --entity-name alice

bin/kafka-configs.sh --zookeeper localhost:2181 --describe --entity-type users --entity-name admin

more /etc/kafka/kafka_server_jaas.conf

KafkaServer {
    org.apache.kafka.common.security.scram.ScramLoginModule required
    username="admin"
    password="admin-secret"
    user_admin="admin";

    org.apache.kafka.common.security.plain.PlainLoginModule required
    username="admin"
    password="admin-secret"
    user_admin="admin-secret"
    user_alice="alice-secret";
};

more /etc/kafka/kafka_client_jaas.conf

KafkaClient {
    org.apache.kafka.common.security.scram.ScramLoginModule required
    username="alice"
    password="alice-secret";
};

consumer.properties 和 producer.properties

security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-256

ssl.truststore.location=/var/private/ssl/client.truststore.jks
ssl.truststore.password=test1234

启动zk

export KAFKA_OPTS=''
bin/zookeeper-server-start.sh config/zookeeper.properties

启动kafka

export KAFKA_OPTS='-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf'
bin/kafka-server-start.sh config/server.properties

启动生产者和消费者

export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"
bin/kafka-console-producer.sh --broker-list localhost:9093 --topic test --producer.config config/producer.properties 

export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf"
bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic test --consumer.config config/consumer.properties

本例说明文档来自

kafka使用SASL/SCRAM认证

参考资料

官网：https://kafka.apache.org/

快速入门：https://kafka.apache.org/quickstart

w3c school: https://www.w3cschool.cn/apache_kafka/

orchome社区：https://www.orchome.com/kafka/index

posted @ 2020-02-17 13:48 麦奇阅读(2238) 评论(0) 收藏举报

刷新页面返回顶部

麦奇

乐观、负责、勇敢、诚实、努力、友善、包容、理解

卡夫卡快速入门

Kafka作为一个分布式的流平台，这到底意味着什么？

什么是kafka的优势？它主要应用于2大类应用：

首先几个概念：

kafka有四个核心API：

首先来了解一下Kafka所使用的基本术语：

Topic

Producer

Consumer

Broker

主题和日志 （Topic和Log）

分布式(Distribution)

Geo-Replication(异地数据同步技术)

生产者(Producers)

消费者(Consumers)

Kafka的保证(Guarantees)

kafka作为一个消息系统

Kafka的流与传统企业消息系统相比的概念如何？

kafka有比传统的消息系统更强的顺序保证。

kafka作为一个存储系统

kafka的流处理

拼在一起

Kafka的使用场景

消息

网站活动追踪

指标

日志聚合

流处理

事件采集

提交日志

kafka安装和启动

Step 1: 下载代码

Step 2: 启动服务

Step 3: 创建一个主题(topic)

Step 4: 发送消息

Step 5: 消费消息

Step 6: 设置多个broker集群

Step 7: 使用 Kafka Connect 来 导入/导出 数据

Step 8: 使用Kafka Stream来处理数据

kafka的生态系统

kafka接口API

Kafka有4个核心API：

kafka客户端发布record(消息)到kafka集群。

send()

pecified by:

Parameters:

Throws:

kafka消费者API

kafka消费者客户端

跨版本兼容性

偏移量和消费者的位置

消费者组和主题订阅

发现消费者故障

示例

自动提交偏移量

手动控制偏移量

订阅指定的分区

offset存储在其他地方

控制消费的位置

消费者流量控制

多线程处理

Kafka Streams API

2.3 Streams API

KafkaStreams客户端（0.10.1.1 API）

Kafka Connect API

springbootd和kafka集成

1. 前言

3. 介绍

3.1. 快速游览（Quick Tour for the Impatient）

3.1.1. 兼容性

3.1.2. 一个非常非常快速的例子

3.1.3. 使用Java配置

3.1.4. Spring Boot更简单的方式

Kafka Broker配置（0.10版）

3.1 Broker配置

Kafka Topic配置

3.2 Topic配置

Kafka Producer配置

主题和日志（Topic和Log）

Step 7: 使用 Kafka Connect 来导入/导出数据

kafka客户端发布`record(消息)`到kafka集群。

可用性和耐久性保证（Availability and Durability Guarantees）

消费者offset跟踪（Consumer Offset Tracking）