Kafka 分区策略

Kafka分区策略

Kafka 为什么要分区？

方便在集群中扩展，每个Partition可以通过调整以适应它所在的机器，而一个topic又可以有多个Partition组成，因此整个集群就可以适应任意大小的数据了。
可以提高并发，因为可以以Partition为单位读写。

生产者分区策略

将生产者产生的数据封装成一个ProducerRecord对象,在指明分区的情况下,直接以指定的值作为partition的值,并没有走分区器;在没有指明分区值,但有key的情况下,将key的hash值与topic的partition数进行取余得到partition值;在既没有分区值也没有key的情况下,Kafka采用粘性分区器,随机选择一个分区,并尽可能一直使用该分区,并且每一个分区都有一个batch,待该分区的batch（16K）已满或者时间范围到了,默认0ms,Kafka再随机一个分区进行使用。

说明：

ProducerRecord是发送给Kafka Broker的'key/value'键值对。
内部数据结构:Topic、PartitionID(可选)、Key(可选)、Value,对应不同的构造函数。

之前版本:

轮询分区 RoundRobinPartitioner(在当前版本) , 之前版本就是 DefaultPartitioner

轮询分区的问题:

每次轮询的向batch中发送数据,当一个batch可能还没有满的时候时间就已经到了,但是数据分布在多个batch中,每个batch的使用效率非常低。

现在版本:

黏性分区. 现在版本就是 DefaultPartitioner 默认分区器。

先判断key是否为空,非空的话跟据key的hash与partition进行取余获取分区号,若为空则走粘性分区器。

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
               if (keyBytes == null) {
                   return stickyPartitionCache.partition(topic, cluster);//粘性分区器分区。
               } 
               List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
               int numPartitions = partitions.size();
               // hash the keyBytes to choose a partition
               return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
        }

走粘性分区器,先获取缓冲part,若part为空说明是第一次,没有缓冲,则会调用nextPartition方法,并给参数prePartition赋值 -1,即没有分区,在实际情况中当一个batch满了以后,会从剩下的分区中重新选择一个分区,而这个时候由于之前有过分区,会给参数prePartition赋值
上一次分区的分区号。
```
public int partition(String topic, Cluster cluster) {
            Integer part = indexCache.get(topic);
            if (part == null) {
                return nextPartition(topic, cluster, -1);
            }
            return part;
        }
```

调用nextPartition方法,先获取上一次分区,由于不知道上一次分区是否存在,所以判断一下。

public int nextPartition(String topic, Cluster cluster, int prevPartition) {
                List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
                Integer oldPart = indexCache.get(topic);
                Integer newPart = oldPart;
                // Check that the current sticky partition for the topic is either not set or that the partition that 
                // triggered the new batch matches the sticky partition that needs to be changed.
                if (oldPart == null || oldPart == prevPartition) {
                    List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
                    if (availablePartitions.size() < 1) {
                        Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                        newPart = random % partitions.size();
                    } else if (availablePartitions.size() == 1) {
                        newPart = availablePartitions.get(0).partition();
                    } else {
                        while (newPart == null || newPart.equals(oldPart)) {
                            Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                            newPart = availablePartitions.get(random % availablePartitions.size()).partition();
                        }
                    }
                    // Only change the sticky partition if it is null or prevPartition matches the current sticky partition.
                    if (oldPart == null) {
                        indexCache.putIfAbsent(topic, newPart);
                    } else {
                        indexCache.replace(topic, prevPartition, newPart);
                    }
                    return indexCache.get(topic);
                }
                return indexCache.get(topic);
        }


      		while (newPart == null || newPart.equals(oldPart)) {
                    Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                    newPart = availablePartitions.get(random % availablePartitions.size()).partition();
            }
		说明：第一次随机一个分区,黏住一直使用，等待该分区使用结束(分区的batch已放满或者超过时间),在从剩下的分区中随机一个,黏住使用， 以此类推....

生产者分区好处:

可以方便再集群中的扩展,当增加服务器节点的时候不用将原有服务器的分区重新分配,可以添加新的分区去满足需求;
以分区进行读写增加了吞吐量,可以提高并发。

消费者分区策略

RoundRobin
- 轮询分区的方式的好处:负载均衡,消费的最大差值为1。
- 问题:
  - 每个主题的分区底层都是一个topicPartition对象,然后获取每个对象的hashCode,按hash值对每个对象进行排序,最后以轮询的方式将数据发送给消费者,这会导致消费者消费到没有订阅到的topic消息。所以只能让一个消费者组的消费者必须订阅同一个topic。
Range
- 不存在轮询分区消费者消费到没有订阅 topic 的问题。
- 问题:
  - 当消费者组中的消费者订阅了两个不同的topic时,由于按范围进行分区,可能会导消费者漏掉订阅topic中一个分区的消息,数据量大的时候还会数据倾斜。
Static
- 粘性分区基于轮询分区,两个要求条件;
  1. 分区的分配要尽可能的均匀。
  2. 分区的分配尽可能的与上次分配的保持相同。
  说明：当两者发生冲突时,第一个目标优先于第二个目标。

posted @ 2021-05-26 22:26 yuexiuping 阅读(573) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

yuexiuping