大叔问题定位分享（48）kafka部分topic出现消息丢失

背景

有同学反馈kafka某个topic有消息丢失（消费比生产的要少），丢失率大概20%

排查

出现问题的topic的replication-factor为2，partition数量为5，该topic状态如下：

有一个topic处于Under Replicated状态，ISR只有1个，193节点被踢掉之后没有进来，而且也没替换为其他节点。

首先看producer有没有报错，kafka 2.0中producer发送方式为async+interceptor方式，通过interceptor.class来指定自定义拦截器

所有的错误都会回调interceptor

通过增加interceptor并在onSendError打印错误日志发现有报错：

ERROR ninebot.iot.hub.KafkaTest 100 - send error :Messages are rejected since there are fewer in-sync replicas than required.
org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.

报错的直接原因为：partition-0只有1个replica，少于配置的2个，所有发到这个partition的消息会被拒绝

在192节点上发现有如下日志：

2021-07-27 17:41:47,159 INFO kafka.cluster.Partition: [Partition test-7 broker=192] Shrinking ISR from 192,193 to 192
2021-07-27 17:51:07,551 ERROR kafka.server.ReplicaManager: [ReplicaManager broker=192] Error processing append operation on partition test-7
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition test-7 is [1], below required minimum [2]

进一步搜索Shrinking ISR from日志发现当时有大量的partition都将193节点踢掉。
查看kafka的代码发现，shrink isr的条件是follower的lastCaughtUpTimeMs超时

查看193节点日志发现当时有个报错

2021-07-27 17:41:11,547 INFO kafka.server.epoch.LeaderEpochFileCache: Updated PartitionLeaderEpoch. New: {epoch:15, offset:201}, Current: {epoch:-1, offset:-1} for Partition: test-1. Cache now contains 0 entries.
2021-07-27 17:41:11,549 WARN kafka.log.Log: [Log partition=test-1, dir=/data/kafka] Newly rolled segment file /data/kafka/test-1/00000000000000000201.log already exists; deleting it first
2021-07-27 17:41:11,549 WARN kafka.log.Log: [Log partition=test-1, dir=/data/kafka] Newly rolled segment file /data/kafka/test-1/00000000000000000201.index already exists; deleting it first
2021-07-27 17:41:11,549 WARN kafka.log.Log: [Log partition=test-1, dir=/data/kafka] Newly rolled segment file /data/kafka/test-1/00000000000000000201.timeindex already exists; deleting it first
2021-07-27 17:41:11,550 ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcher replicaId=193, leaderId=192, fetcherId=2] Error due to
org.apache.kafka.common.KafkaException: Error processing data for partition test-1 offset 201
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:207)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172)
at scala.Option.foreach(Option.scala:257)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: org.apache.kafka.common.KafkaException: Trying to roll a new log segment for topic partition test-1 with start offset 201 while it already exists.
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1498)
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1465)
at kafka.log.Log.maybeHandleIOException(Log.scala:1842)
at kafka.log.Log.roll(Log.scala:1465)
at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1450)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:858)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:752)
at kafka.log.Log.maybeHandleIOException(Log.scala:1842)
at kafka.log.Log.append(Log.scala:752)
at kafka.log.Log.appendAsFollower(Log.scala:733)
at kafka.cluster.Partition$$anonfun$doAppendRecordsToFollowerOrFutureReplica$1.apply(Partition.scala:606)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257)
at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:593)
at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:613)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:127)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:40)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:186)
... 13 more
2021-07-27 17:41:11,550 INFO kafka.server.ReplicaFetcherThread: [ReplicaFetcher replicaId=193, leaderId=192, fetcherId=2] Stopped

调用过程如下：

AbstractFetcherThread.processFetchRequest
→ ReplicaFetcherThread.processPartitionData
→ Partition.appendRecordsToFollowerOrFutureReplica
→ Log.maybeRoll
→ Log.roll

Log.roll代码如下：

从日志看出，首先有3个delete操作

Newly rolled segment file /data/kafka/test-1/00000000000000000201.log already exists; deleting it first
Newly rolled segment file /data/kafka/test-1/00000000000000000201.index already exists; deleting it first
Newly rolled segment file /data/kafka/test-1/00000000000000000201.timeindex already exists; deleting it first

然后报错

Trying to roll a new log segment for topic partition test-1 with start offset 201 while it already exists.

说明删除失败，怀疑当时文件正在打开写入，导致删除失败

服务器文件情况如下：

# ls /data/kafka/test-1/ -l
total 4
-rw-r--r-- 1 kafka kafka 10485760 Jul 27 17:41 00000000000000000201.index
-rw-r--r-- 1 kafka kafka 0 Jul 27 17:41 00000000000000000201.log
-rw-r--r-- 1 kafka kafka 10485756 Jul 27 17:41 00000000000000000201.timeindex
-rw-r--r-- 1 kafka kafka 11 Jul 27 17:41 leader-epoch-checkpoint

为什么一个ReplicaFetcherThread报错会导致25个partition剔除193？

ReplicaFetcherManager的父类AbstractFetcherManager中有一个变量

private[server] val fetcherThreadMap = new mutable.HashMap[BrokerIdAndFetcherId, AbstractFetcherThread]

这个map会以brokerId和fetcherId为key来存放ReplicaFetcherThread，可以认为1个source broker对应1个fetcher thread，这个fetcher thread中会负责所有在source broker上partition的同步，ReplicaFetcherThread的报错导致整个线程退出，