Kafka幂等和事务

幂等与事务

幂等

Kafka目前默认是at least once，可以通过开启幂等和事务实现 exactly once。所谓的幂等，就是对接口的多次调用所产生的结果和调用一次是一致的。由于生产者在发送失败后重试会导致写入重复消息，而Kafka的幂等功能可以避免这种情况。开启幂等后，当应用端通过send()方法将消息放到producer的发送缓冲时，会为每条消息生成sequence number，每条消息的sequence number生成后不会再改变，这样当某条消息发送失败自动重新发送时，broker根据消息中的sequence number能知道这条消息是否重复。开启幂等需要生产者客户端参数 enable.idempotence 为 true，retries参数大于0，acks 参数为 -1(all)，max.in.flight.requests.per.connection 参数的值不能大于 5（默认5）

To take advantage of the idempotent producer, it is imperative(必要地) to avoid application level re-sends since these cannot be de-duplicated. As such, if an application enables idempotence, it is recommended to leave the retries config unset, as it will be defaulted to Integer.MAX_VALUE. Additionally, if a {@link #send(ProducerRecord)} returns an error even with infinite retries (for instance if the message expires in the buffer before being sent), then it is recommended to shut down the producer and check the contents of the last produced message to ensure that it is not duplicated.

为了实现生产者的幕等性， Kafka 引入了 producer id（以下简称 PID ）和序列号（ sequence number ）这两个概念，这两个概念分别对应 v2 版的日志格式中 RecordBatch 的 producer id 和 first sequence 这两个宇段。每个生产者实例在初始化的时候都会被分配一个PID ，这个 PID 对用户而言是完全透明的，这个 PID 是全局唯一的，Producer 故障后重新启动后会被分配一个新的 PID，这也是幂等性无法做到跨会话的一个原因。对于每个 PID, 消息发送到的每一个分区都有对应的序列号，这些序列号从 0 开始单调递增。生产者每发送一条消息就会将＜PID, 分区＞对应的序列号的值加1。broker 会在内存中为每个＜PID, 分区＞维护一个序列号。broker会为每个＜PID, 分区＞缓存最近最新 5个 batch 的信息（如果超过5个，添加时会进行删除，这也是客户端参数max.in.flight.requests.per.connection 参数的值不能大于 5的原因），对于收到的每一条消息，首先会判断是是否和缓存中的消息重复，如果重复直接成功返回，否则只有当它的序列号的值（ SN_new ）比 broker 端中维护的对应的序列号的值（ SN_old ）大 1 （即 SN_new = SN_old + 1 ）时， broker 才会接收它。如果 SN_new < SN_old + 1 ，那么说明消息被重复写入， broker 会直接丢弃。如果 SN_new> SN_old + 1 ，那么说明中间有数据尚未写入，出现了乱序，暗示可能有消息丢失，生产者会抛出 OutOfOrderSequenceException。不过这个异常一般不会出现，因为底层是TCP，一般不会出现消息丢失和OutOfOrder。引入序列号来实现幂等也只是针对每个＜PID ，分区＞而言的，也就是说， Kafka 的幂等只能保证单个生产者会话（ session ）中单分区的幂等。如果生产者的上层应用调用KafkaProducer发送了两条相同内容的消息，但对 Kafka 而言是两条不同的消息，会为这两条消息分配不同的序列号。因此Kafka 并不会保证消息内容的幂等。如果需要跨会话、跨多个 topic-partition 的情况，需要使用 Kafka 的事务性来实现。

PID

http://matt33.com/2018/10/24/kafka-idempotent/
在没有开启事务（没有transactionalId）的情况下，PID的生成相对简单。Client 向负载最低的Broker 发送一个 InitProducerIdRequest 请求获取 PID， broker会返回内存中维护的nextProducerId，然后让nextProducerId自增1。broker会在本地的 PID 用完了或者处于新建状态时，向zk申请 PID 段（默认情况下，每次申请 1000 个 PID）；broker申请 PID 段的流程如下：

从 zk 的 /latest_producer_id_block 节点读取最新已经分配的 PID 段信息和zkVersion；
- /latest_producer_id_block 的数据格式如下所示
```
 {"version":1,"broker":35,"block_start":"4000","block_end":"4999"}
```
如果该节点不存在，直接从 0 开始分配，选择 0~1000 的 PID 段（PidBlockSize 默认为 1000，即是每次申请的 PID 段大小）；
如果该节点存在，读取其中数据，根据 block_end 选择这个 PID 段（如果 PID 段超过 Long 类型的最大值，这里会直接返回一个异常）；
在选择了相应的 PID 段后，将这个 PID 段信息连同zkVersion写回到 zk 的这个节点中，如果写入成功，那么 PID 段就证明申请成功，如果写入失败（写入时zk会判断当前节点的 zkVersion 是否与步骤1获取的 zkVersion 相同，如果相同，那么可以成功写入，否则写入就会失败，证明这个节点被修改过），说明此时可能有其他的 Broker 已经更新了这个节点（当前的 PID 段可能已经被其他 Broker 申请），那么从步骤 1 重新开始，直到写入成功。

producer epoch

在没有开启事务（没有transactionalId）的情况下， broker向client返回的InitProducerIdResult中包含的producer epoch总是0。
producer后续向broker发送的消息中还有producer epoch（一直是0），broker收到消息后会做以下校验：

检查该 PID 是否已经缓存中存在
1. 如果不存在，那么判断 sequence number 是否从0 开始：
  1. 是的话，在缓存中记录 PID 的 meta（PID，epoch， sequence number），并执行写入操作，
  2. 否则返回 UnknownProducerIdException
    （PID 在 server 端已经过期或者这个 PID 写的数据都已经过期了，但是 Client 还在接着上次的 sequence number 发送数据）
2. 如果该 PID 在缓存中存在，先检查 PID epoch 与 server 端记录的是否相同；
  1. 如果不同并且 sequence number 不从 0 开始，那么返回 OutOfOrderSequenceException 异常；
  2. 如果不同并且 sequence number 从 0 开始，那么正常写入；
  3. 如果相同，那么根据缓存中记录的最近一次 sequence number检查是否为连续，不连续的情况下返回 OutOfOrderSequenceException 异常。

幂等性下的乱序问题

当 MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION 配置大于1时，如果没有开启幂等，可能无法保证有序性。比如client 依然发送了 5 个请求 1、2、3、4、5，这 5 个请求中 2-5 成功 ack 了，1 失败了（这种情况个人认为不会出现），这时候1会重试，这时候数据就出现了乱序，因为 1 的数据已经晚于了 2-5。
但是在开启幂等的场景下，当 MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION 配置大于1时，目前是没有乱序的问题的。因为Server 端在处理请求时会检查请求的sequence number 必须是连续的，如果不连续都会返回异常，这样 Client 会进行相应的重试。比如 Client 发送请求的顺序是 1、2、3、4、5，如果中间的请求 2 出现了异常，那么会导致 3、4、5 都返回异常进行重试（因为 sequence number 不连续），此时 2、3、4、5 都会进行重试。Client 端在请求重试时，会按照 sequence number 的顺序重新发送。

PID snapshot

对于每个 Topic-Partition，Broker 都会在内存中维护其 PID 与 sequence number的对应关系。Broker 重启时，如果想恢复之前的状态信息，它需要读取所有的 log 文件。相比于之下，定期对这个状态信息做 checkpoint（Snapshot）收益是非常大的，此时如果 Broker 重启，只需要读取最近一个 Snapshot 文件恢复之前的状态信息即可。 PID Snapshot 样式如下：

// 代码结构参见 ProducerStateEntry
[matt@XXX-35 app.matt_test_transaction_json_3-2]$ /usr/local/java18/bin/java -Djava.ext.dirs=/XXX/kafka/libs kafka.tools.DumpLogSegments --files 00000000000235947656.snapshot
producerId: 2000 producerEpoch: 1 coordinatorEpoch: 4 currentTxnFirstOffset: None firstSequence: 95769510 lastSequence: 95769511 lastOffset: 235947654 offsetDelta: 1 timestamp: 1541325156503
producerId: 3000 producerEpoch: 5 coordinatorEpoch: 6 currentTxnFirstOffset: None firstSequence: 91669662 lastSequence: 91669666 lastOffset: 235947651 offsetDelta: 4 timestamp: 1541325156454

事务

因为幂等不能在跨多个 Partition 和跨会话场景下使用，而事务可以弥补这些缺陷。事务可以保证对多个分区的写入操作的原子性。操作的原子性是指多个操作要么全部成功，要么全部失败，不存在部分成功、部分失败的可能。Kafka 事务可以保证：

跨会话的幂等性写入：即使producer发生故障，恢复后依然可以继续保持幂等性（继续提供 Exactly-Once 语义保证）；
跨会话的事务恢复：如果一个应用实例挂了，启动的下一个实例依然可以保证上一个事务完成（commit 或者 abort）；
跨多个 Topic-Partition 的写入操作的原子性，跨分区的数据要么全部写入成功，要么全部失败，不会出现中间状态。

为了实现事务，每个producer必须手动设置唯一的 transactionalId，即使故障恢复后也不会改变，transactionalId 与 PID 一一对应，两者之间所不同的是 transactionalId 由用户显式设置，而 PID 是由 Kafka 服务端分配的，PID在故障恢复后也不会改变。另外，为了保证新的生产者启动后具有相同 transactionalId 的旧生产者能够立即失效，每个生产者通过 transactionalId 获取 PID 的同时，还会获取一个单调递增的 producer epoch。如果使用同一个 transactionalId 开启两个生产者，那么旧的生产者（producer_epoch较低）会抛出异常而不再工作。这个机制可以实现事务跨生产者的转移和恢复，当某个生产者挂掉后，具有相同transactionalId的新生产者实例可以保证任何未完成的旧事务要么被提交（如果事务发起过commit，处在PrepareCommit或PrepareAbort状态，会返回异常让client等待后重试），要么被中止（如果事务未发起过提交，处在Ongoing阶段，server会自动Abort事务，同时返回异常让client等待后重试），如此可以使新的生产者实例从一个正常的状态开始工作。transaction.timeout.ms(默认60s）用于控制事务的超时时间，TransactionCoordinator 将会在这个事务超时之后 abort 这个事务；

从消费者的角度分析，事务能保证的语义相对偏弱。出于以下原因， Kafka 并不能保证己提交的事务中的所有消息都能够被消费：

日志中的消息可能会被删除（过期或因为采用日志压缩策略）
消费者可以通过 seek方法访问任意 offset 的消息，从而可能遗漏事务中的部分消息。
消费者如果先提交位移后消费，就可能遗漏消息

消费端参数 isolation.level 的默认值为 “read uncommitted”，意思是说消费端应用可以看到（消费到）未提交的事务，当然对于己提交的事务也是可见的。这个参数还可以设置为“read committed ”，表示消费端应用不可以看到尚未提交的事务内的消息。举个例子，如果生产者开启事务并向某个分区值发送 3 条消息 msg1 、 msg2 和 msg3 ，在执行 commitTransaction()或 abortTransaction()方法前，设置为“read_committed” 的消费端应用是消费不到这些消息的

对于 isolation.level 为 “read committed”的consumer，其在向 broker 发送 Fetch 请求时，broker 只会返回 LSO 之前的数据，在 LSO 之后的数据不会返回（参见 handleFetchRequest）。Broker 会追踪每个 Partition 涉及到的 abort transactions，Partition 的每个 log segment 都会有一个 append-only file 来存储 abort transaction 信息，因为 abort transaction 并不是很多，所以这个开销是可以可以接受的，之所以要持久化到磁盘，主要是为了故障后快速恢复，要不然 Broker 需要把这个 Partition 的所有数据都读一遍，才能直到哪些事务是 abort 的，这样的话，开销太大。broker收到consumer的Fetch请求返回数据时，会把这批数据涉及到的abort transaction 的集合一起返回给 Consumer。Consumer 消费数据时，顺序性还是严格按照 offset 的，只不过遇到 abort trsansaction 的数据时就丢弃掉，其他的与普通 Consumer 并没有区别。transaction marker (commit or abort) 也会占据offset，再加上abort的事务消息会被过滤，因此消费端应用看到的消息的offset不是连续的。

日志文件中除了普通的消息，还有一种消息专门用来标志一个事务的结束，它就是控制消息(ControlBatch)。控制消息共有两种类型： COMMIT和ABORT, 分别用来表征事务已经成功提交或已经被成功中止。 KaflcaConsumer可以通过这个控制消息来判断对应的事务是被提交了还是被中止了，然后结合参数isolation.level配置的隔离级别来决定是否将相应的消息返回给消费端应用，注意ControlBatch对消费端应用不可见。

KafkaProducer 有 5 个与事务相关的方法，KafkaConsumer中没有与事务相关的方法。

/* Needs to be called before any other methods when the transactional.id is set in the configuration.
 * This method does the following:
 * 1. Ensures any transactions initiated by previous instances of the producer with the same
 *    transactional.id are completed. If the previous instance had failed with a transaction in
 *    progress, it will be aborted. If the last transaction had begun completion,
 *    but not yet finished, this method awaits its completion.
 * 2. Gets the internal producer id and epoch, used in all future transactional messages issued by the producer.
 **/
 public void initTransactions()

/* Should be called before the start of each new transaction. Note that prior to the first invocation
 * of this method, you must invoke {@link #initTransactions()} exactly one time.  */
public void beginTransaction() throws ProducerFencedException;

/** Sends a list of specified offsets to the consumer group coordinator, and also marks
 * those offsets as part of the current transaction. These offsets will be considered
 * committed only if the transaction is committed successfully. The committed offset should
 * be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
 *
 * This method should be used when you need to batch consumed and produced messages
 * together, typically in a consume-transform-produce pattern. Thus, the specified
 * {@code consumerGroupId} should be the same as config parameter {@code group.id} of the used
 * {@link KafkaConsumer consumer}. Note, that the consumer should have {@code enable.auto.commit=false}
 * and should also not commit offsets manually (via {@link KafkaConsumer#commitSync(Map) sync} or
 * {@link KafkaConsumer#commitAsync(Map, OffsetCommitCallback) async} commits). */
public void sendOffsetsToTransaction(Map<TopicPartition, OffsetAndMetadata> offsets, String consumerGroupId)

/** Commits the ongoing transaction. This method will flush any unsent records before actually committing the transaction.
 * Further, if any of the {@link #send(ProducerRecord)} calls which were part of the transaction hit irrecoverable
 * errors, this method will throw the last received exception immediately and the transaction will not be committed.
 * So all {@link #send(ProducerRecord)} calls in a transaction must succeed in order for this method to succeed. */
public void commitTransaction() throws ProducerFencedException

/** Aborts the ongoing transaction. Any unflushed produce messages will be aborted when this call is made.
* This call will throw an exception immediately if any prior {@link #send(ProducerRecord)} calls failed with a
* {@link ProducerFencedException} or an instance of {@link org.apache.kafka.common.errors.AuthorizationException} */
public void abortTransaction() throws ProducerFencedException

流式处理中常见的Consume-Transform-Produce模式

在这种模式下消费和生产并存：应用程序从某个主题中消费消息，然后经过一系列转换后写入另一个主题，消费者可能在提交消费位移的过程中出现问题而导致重复消费，也有可能生产者重复生产消息。 Kafka 中的事务可以使应用程序将消费消息、生产消息、提交消费位移当作原子操作来处理，同时成功或失败，即使该生产或消费会跨多个分区。KafkaProducer的sendOffsetsToTransaction()方法主要用于这个场景

public class TransactionConsumeTransformProduceExample {
    public static final String brokerList = "10.198.197.73:9092";

    public static Properties getConsumerProperties() {
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "groupId");
        return props;
    }

    public static Properties getProducerProperties() {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "transactionalId");
        return props;
    }

    public static void main(String[] args) {
        //初始化生产者和消费者
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(getConsumerProperties());
        consumer.subscribe(Collections.singletonList("topic-source"));
        KafkaProducer<String, String> producer = new KafkaProducer<>(getProducerProperties());
        //初始化事务
        producer.initTransactions();
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1000));
            if (!records.isEmpty()) {
                Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
                //开启事务
                producer.beginTransaction();
                try {
                    for (TopicPartition partition : records.partitions()) {
                        List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
                        for (ConsumerRecord<String, String> record : partitionRecords) {
                            //do some logical processing.
                            ProducerRecord<String, String> producerRecord = 
                                    new ProducerRecord<>("topic-sink", record.key(), record.value());
                            //消费-生产模型
                            producer.send(producerRecord);
                        } // end for
                        long lastConsumedOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
                        offsets.put(partition, new OffsetAndMetadata(lastConsumedOffset + 1));
                    } // end for
                    // 提交消费位移
                    // 注意这里是由生产者使用消费者的groupId参数提交消费位移，将消息消费，生产消息，提交消费位移一起当成原子操作处理
                    // producer使用 sendOffsetsToTransaction()时，必须将消费者参数 enable.auto.commit设置为false，
                    // 并且需要保证消费者不能手动提交消费位移，所有的提交位移操作由producer完成
                    producer.sendOffsetsToTransaction(offsets, "groupId");
                    //提交事务
                    producer.commitTransaction();
                } catch (ProducerFencedException e) {
                    //log the exception
                    //中止事务
                    producer.abortTransaction();
                }
            } // end if
        } // end while
    }
}

事务协调器(TransactionCoordinator)

http://matt33.com/2018/11/04/kafka-transaction/
每个开启事务的生产者都会被指派一个 TransactionCoordinator, 所有的事务逻辑包括分派PID等都是由TransactionCoordinator来负责实施的。TransactionCoordinator会将事务状态信息持久化到内部主题_transaction_state中，如果一个事务的 TransactionCoordinator 挂的话，需要转移到其他的机器上，新的TransactionCoordinator 在做故障恢复也是从这个 topic 中读取数据恢复事务状态信息。下面就以最复杂的consume-transform-produce的流程为例来分析Kafka 事务的实现原理。

1、查找TransactionCoordinator

生产者会向负载最低的broker节点发送请求，请求中包含自己的 transactionlId ，broker收到请求后，会计算transactionlId 的hash值的绝对值，然后对 __transaction_state 的分区数求余，找到对应分区的leader所在的broker（与查找GroupCoordinator的方法类似），这个broker就是这个生产者的 TransactionCoordinator ，然后向生产者返回broker信息。

如果生产者使用了 sendOffsetsToTransaction()方法，生产者在后面还需要根据groupId找到GroupCoordinator

2、获取PID和producer epoch

生产者会向 TransactionCoordinator发送InitProducerIdRequest请求获取PID和producer epoch，请求中包含transactionlId（如果生产者只使用幂等而不使用事务，那么InitProducerIdRequest 请求可以发送给任意的 broker，请求中不包含有效的transactionlId）。

TransactionCoordinator收到包含 transactionalId的 InitProducerIdRequest请求后，会从自己的缓存查找该transactionalId的TransactionMetadata（transaction.id.expiration.ms控制缓存过期时间）是否存在，如果不存在，为producer生成一个新的producer_id和producer_epoch（初始值为0），如果TransactionMetadata存在，查看TransactionMetadata中的事务状态，如果此时的状态为 PrepareAbort 或 PrepareCommit，返回 CONCURRENT_TRANSACTIONS 异常让client等待事务完成后再重试；如果此时的状态为 Ongoing，会开始 abort 当前的事务，并向 client 返回 CONCURRENT_TRANSACTIONS 异常让client等待和重试；如果状态是CompleteAbort、CompleteCommit 或 Empty，那可以继续进行，递增缓存中的producer_epoch（如果producer_epoch超过short的最大值，生成一个新的producer_id，然后让producer_epoch从0开始，如果producer_epoch没有超过short的最大值，那producer_id不会改变），增大producer_epoch后，具有相同 PID 但 producer_epoch 小于该 producer_epoch 的其他生产的事务将被拒绝；然后TransactionCoordinator 会将 <transactionalId, TransactionMetadata> 保存到主题__transaction_state ，这样可以保证该transaction_Id对应的事务状态信息被持久化，即使TransactionCoordinator挂机，该事务状态信息也不会丢失。最后向client返回producer_id和producer_epoch。存储到主题 __transaction_state中的具体内容格式如图所示。

其中 transaction_status 包含 Empty(0)、 Ongoing(1)、 PrepareComrnit(2)、 PrepareAbort(3)、 CompleteCommit(4)、 CompleteAbort(S)、 Dead(6)这几种状态。发送到 __transaction_state 中的事务日志消息同样会单独的根据transactionalId来计算要发送的分区，找到该分区的leader副本，该leader就是TransactionCoordinator所在的broker。

3、producer开启事务

producer调用beginTransaction()方法本地开启事务，这一步并没有与 Server 端进行交互，只有在生产者发送第一条消息之后 TransactionCoordinator 才会认为该事务已经开启。

4、Consume-Transform-Produce阶段

这个阶段囊括了整个事务的数据处理过程，其中还涉及多种请求。

4.1、AddPartitionsToTxnRequest
生产者调用send()方法给一个新的分区(TopicPartition)发送数据时，生产者会向TransactionCoordinator 发送 AddPartitionsToTxnRequest请求（将AddPartitionsToTxnRequest请求放到发送线程的队列中异步发送的），TransactionCoordinator 会将这个TopicPartition信息添加到 transactionalId 对应的 TransactionMetadata 中，然后将新TransactionMetadata 存储在主题 __transaction_state。后续为使用到的分区设置COMMIT或ABORT标记会依赖TransactionMetadata中的分区信息。
4.2、生产者向broker发送消息
和普通的消息不同的是，生产者发送的消息中会包含的PID、producer_epoch和 sequence number。
4.3、AddOffsetsToTxnRequest
kafkaProducer内部有个TransactionManager负责事务，当调用了KafkaProducer的sendOffsetsToTransaction()方法后（该方法包含2个参数： Map<TopicPartition, OffsetAndMetadata> offsets和 groupld），TransactionManager会先向TransactionCoordinator发送AddOffsetsToTxnRequest请求（将AddOffsetsToTxnRequest请求放到发送线程的队列中异步发送的），请求中包含有groupId。TransactionCoordinator收到后会使用和GroupCoordinator相同的一段代码来计算出该groupId提交的消费位移在__consumer_offset中的分区，然后将这个__consumer_offset的分区信息保存在 TransactionMetadata 和 __transaction_state中。
4.4、TxnOffsetCommitRequest
这个请求也是KafkaProducer的sendOffsetsToTransaction()方法的一部分。等收到AddOffsetsToTxnRequest的响应之后，kafkaProducer内部的TransactionManager会找到该groupId对应的 GroupCoordinator，然后向 GroupCoordinator 发送 TxnOffsetCommitRequest 提交消费位移，GroupCoordinator 在收到相应的请求后，会将 offset 信息持久化到 __consumer_offset 中（包含对应的 PID 信息），更新到消费组缓存的pendingTransactionalOffsetCommits Map中，等这个事务commit了，才将pendingTransactionalOffsetCommits Map中的位移信息移动到消费组缓存的offsets Map中，因此在这个事务 commit 之前，consumer通过 OffsetFetchRequest 获取消费位移会获取不到，因为OffsetFetchRequest 获取消费位移只从消费组缓存的offsets Map中获取）。

5、提交或者中止事务

调用KafkaProducer的 commitTransaction（This method will flush any unsent records before actually committing the transaction）或 abortTransaction(Any unflushed produce messages will be aborted when this call is made，可能会出现这种情况：producer向某个partition发送消息后，在还没有把这个partition的信息通过AddPartitionsToTxnRequest发送给TransactionCoordinator之前就调用了abortTransaction，此时TransactionCoordinator由于感知不到这个partition，不会向这个partition的leader发送WriteTxnMarkersRequest，因此这个partition的日志最后不会有这个事务的ControlBatch消息，见TransactionManager的maybeAddPartitionToTransaction和beginAbort方法）方法后，生产者会向TransactionCoordinator发送EndTxnRequest请求（添加到发送队列中异步发送，EndTxnRequest是最后一个请求，当TransactionCoordinator收到后表示已经收到关于这个事务的所有消息），以此来通知它提交(Commit)还是中止(Abort)事务。

TransactionCoordinator 收到请求后会执行如下操作：

5.1、生成新的TransactionMetadata，其中的事务状态改为PREPARE_COMMIT或PREPARE_ABORT。先将新的TransactionMetadata 写入主题 __transaction_state（两阶段提交，Server 端的 TransactionCoordinator 充当协调者），如果写入__transaction_state成功，收到所有ACK后（required acks = -1），则用新的TransactionMetadata替换该 transactionalId 旧的TransactionMetadata，然后向client返回响应（KafkaProducer 的commitTransaction()和abortTransaction()方法会同步等待TransactionCoordinator的响应）。
5.2、TransactionCoordinator向producer使用过的所有普通主题的分区和 _consumer_offsets的分区的 leader节点发送 WriteTxnMarkersRequest请求（放到发送队列），leader节点收到请求后会在对应分区的日志文件中写入ControlBatch控制消息（包含事务的COMMIT或 ABORT信息），并通过延迟操作等待其他follower都成功同步（required acks = -1，有超时时间），如果是_consumer_offsets的leader还会调用 GroupCoordinator将消费位移从pendingTransactionalOffsetCommits Map中移动到消费组缓存的offsets Map中（这样consumer通过 OffsetFetchRequest 才可以获取到），最后向TransactionCoordinator返回响应。
- 等到收到所有broker的正确响应后（使用延时操作，没有超时时间，见TransactionMarkerRequestCompletionHandler），TransactionCoordinator进入5.3。
- 如果请求失败，能重试的则重试，否则抛异常或者放弃
5.3、生成新的TransactionMetadata，其中的事务状态为COMPLETE_COMMIT或COMPLETE_ABORT，先将新的TransactionMetadata 写入主题 __transaction_state，如果写入__transaction_state成功，则用新的TransactionMetadata替换该 transactionalId 旧的TransactionMetadata

中间流程故障如何恢复

对于上面所讲述的一个事务操作流程，实际生产环境中，任何一个地方都有可能出现的失败：

Producer 在发送 beginTransaction() 时，如果出现 timeout 或者错误：Producer 只需要重试即可；
Producer 在发送数据时出现错误：Producer 应该 abort 这个事务，如果 Produce 没有 abort（比如设置了重试无限次，并且 batch 超时设置得非常大），TransactionCoordinator 将会在这个事务超时之后 abort 这个事务操作；
Producer 发送 commitTransaction() 时出现 timeout 或者错误：Producer 应该重试这个请求；
TransactionCoordinator Failure：如果 TransactionCoordinator 发生切换（__transaction_state topic leader 切换），新的TransactionCoordinator 可以从__transaction_state 中恢复事务状态。如果新的TransactionCoordinator发现事务有处于 PREPARE_COMMIT 或 PREPARE_ABORT 状态，那么TransactionCoordinator会继续执行 commit 或者 abort 操作。如果发现是一个正在进行的事务，TransactionCoordinator 并不需要 abort 事务，producer 只需要向新的 TransactionCoordinator 发送请求即可继续事务。

TransactionCoordinator epoch

TransactionCoordinator 是在 __transaction_state 这个topic的某个partition的leader上，为了避免 leader切换的过程中出现两个TransactionCoordinator（某个partition出现两个leader），每个 TransactionCoordinator 都有其 CoordinatorEpoch 值，这个值就是对应 __transaction_state 的Partition 的 Epoch 值（每当 leader 切换一次，该值就会自增1）。其他 broker 在收到 TransactionCoordinator 请求时如果发现 CoordinatorEpoch 值比自己缓存中的最新的值小，那会拒绝这个请求。

posted @ 2022-12-27 21:28 zoo-keeper 阅读(525) 评论(0) 收藏举报

刷新页面返回顶部

zoo-keeper