spring rabbitmq consumer reconnect源码分析

 

一、起因

某日深夜,生产环境,rabbitmq服务器集群宕机,收到企微告警,以为rabbitmq server端有灾难恢复和高可用机制,业务不会受影响,就没关注该异常。

早上,到公司之后,就被运维告知,负责的某个服务消费的queue有消息积压,和平常的业务表现不同,隐约感觉和晚上的rabbitmq服务器异常有关。检查服务日志,发现服务没有消费rabbitmq数据,grafana上消费的某些queue consumer数量为0,而某些queue consumer正常。

疑问:该服务对rabbitmq的使用,抽象一下,A->B->C,其中A\B\C是queue,现象是rabbitmq server发生宕机恢复后,服务对A\B queue consumer 数量为0,服务对A\B queue consumer 数量正常,spring rabbitmq consumer reconnect客户端机制是怎么样的?为什么会有差异?

二、源码分析

带着疑问,分析服务业务代码,阅读spring rabbitmq源码。

结论:发现服务对A\B\C queue的消费,有差异,差异造成了rabbitmq server发生宕机恢复后,consumer reconnect表现形式不同,其中A\B使用的是@RabbitListener注解,底层是SimpleMessageListenerContainer,C是继承的接口MessageListener,底层是DirectMessageListenerContainer。

从日志搜索关键字,

  • SimpleMessageListenerContainer,有限次重试reconnect,日志只有打印几次关键字:“Consumer threw missing queues exception, fatal=true”“Stopping container from aborted consumer”“Restarting Consumer XXX”
  • DirectMessageListenerContainer,有不断重试reconnect,日志有不断打印关键字:***”Queue not present, scheduling consumer XXX for queue XXX for restart”***

生产环境,springboot版本为2.4.6,spring-rabbit版本为2.3.7;以下源码,springboot版本为3.1.5,spring-rabbit版本为3.0.10。

1.SimpleMessageListenerContainer

SimpleMessageListenerContainer消费者consumer核心逻辑在org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.AsyncMessageProcessingConsumer类,AsyncMessageProcessingConsumer是一个Runnable,主要逻辑在run方法,主要是控制逻辑,真正的消费逻辑,在内部属性BlockingQueueConsumer类,源码片段如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
#org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.AsyncMessageProcessingConsumer
private final class AsyncMessageProcessingConsumer implements Runnable {

private final BlockingQueueConsumer consumer; //真正的消费mq逻辑,其他属性略

@Override // NOSONAR - complexity - many catch blocks
public void run() { // NOSONAR - line count
if (!isActive()) {
this.start.countDown();
return;
}

boolean aborted = false; //重要的标志位,异常时为true

this.consumer.setLocallyTransacted(isChannelLocallyTransacted());

String routingLookupKey = getRoutingLookupKey();
if (routingLookupKey != null) {
SimpleResourceHolder.bind(getRoutingConnectionFactory(), routingLookupKey); // NOSONAR both never null
}

if (this.consumer.getQueueCount() < 1) {
if (logger.isDebugEnabled()) {
logger.debug("Consumer stopping; no queues for " + this.consumer);
}
SimpleMessageListenerContainer.this.cancellationLock.release(this.consumer);
if (getApplicationEventPublisher() != null) {
getApplicationEventPublisher().publishEvent(
new AsyncConsumerStoppedEvent(SimpleMessageListenerContainer.this, this.consumer));
}
this.start.countDown();
return;
}

try {
initialize();
while (isActive(this.consumer) || this.consumer.hasDelivery() || !this.consumer.cancelled()) {
mainLoop(); //有条件的死循环,mainLoop为BlockingQueueConsumer consumer真正消费rabbitmq消息,mainLoop不是异常重连reconnect的重点,重点关注while的循环条件,发生异常或者,不满足条件时,会跳出死循环。
}
}
catch (InterruptedException e) {
logger.debug("Consumer thread interrupted, processing stopped.");
Thread.currentThread().interrupt();
aborted = true;
publishConsumerFailedEvent("Consumer thread interrupted, processing stopped", true, e);
}
catch (QueuesNotAvailableException ex) {
logger.error("Consumer threw missing queues exception, fatal=" + isMissingQueuesFatal(), ex); //该日志就是日志输出的异常,rabbitmq server端不可用,这时aborted标志位会设置为true。
if (isMissingQueuesFatal()) {
this.startupException = ex;
// Fatal, but no point re-throwing, so just abort.
aborted = true;
}
publishConsumerFailedEvent("Consumer queue(s) not available", aborted, ex);
}
catch (FatalListenerStartupException ex) {
logger.error("Consumer received fatal exception on startup", ex);
this.startupException = ex;
// Fatal, but no point re-throwing, so just abort.
aborted = true;
publishConsumerFailedEvent("Consumer received fatal exception on startup", true, ex);
}
catch (FatalListenerExecutionException ex) { // NOSONAR exception as flow control
logger.error("Consumer received fatal exception during processing", ex);
// Fatal, but no point re-throwing, so just abort.
aborted = true;
publishConsumerFailedEvent("Consumer received fatal exception during processing", true, ex);
}
catch (PossibleAuthenticationFailureException ex) {
logger.error("Consumer received fatal=" + isPossibleAuthenticationFailureFatal() +
" exception during processing", ex);
if (isPossibleAuthenticationFailureFatal()) {
this.startupException =
new FatalListenerStartupException("Authentication failure",
new AmqpAuthenticationException(ex));
// Fatal, but no point re-throwing, so just abort.
aborted = true;
}
publishConsumerFailedEvent("Consumer received PossibleAuthenticationFailure during startup", aborted, ex);
}
catch (ShutdownSignalException e) {
if (RabbitUtils.isNormalShutdown(e)) {
if (logger.isDebugEnabled()) {
logger.debug("Consumer received Shutdown Signal, processing stopped: " + e.getMessage());
}
}
else {
logConsumerException(e);
}
}
catch (AmqpIOException e) {
if (e.getCause() instanceof IOException && e.getCause().getCause() instanceof ShutdownSignalException
&& e.getCause().getCause().getMessage().contains("in exclusive use")) {
getExclusiveConsumerExceptionLogger().log(logger,
"Exclusive consumer failure", e.getCause().getCause());
publishConsumerFailedEvent("Consumer raised exception, attempting restart", false, e);
}
else {
logConsumerException(e);
}
}
catch (Error e) { //NOSONAR
logger.error("Consumer thread error, thread abort.", e);
publishConsumerFailedEvent("Consumer threw an Error", true, e);
getJavaLangErrorHandler().handle(e);
aborted = true;
}
catch (Throwable t) { //NOSONAR
// by now, it must be an exception
if (isActive()) {
logConsumerException(t);
}
}
finally {
if (getTransactionManager() != null) {
ConsumerChannelRegistry.unRegisterConsumerChannel();
}
}

// In all cases count down to allow container to progress beyond startup
this.start.countDown();

killOrRestart(aborted); //当发生异常,当前AsyncMessageProcessingConsumer会跳出死循环,aborted=true。

if (routingLookupKey != null) {
SimpleResourceHolder.unbind(getRoutingConnectionFactory()); // NOSONAR never null here
}
}

private void killOrRestart(boolean aborted) {
if (!isActive(this.consumer) || aborted) { //注意IF条件,二选一为真会进入,aborted=true,会执行stop方法
logger.debug("Cancelling " + this.consumer);
try {
this.consumer.stop();
SimpleMessageListenerContainer.this.cancellationLock.release(this.consumer);
if (getApplicationEventPublisher() != null) {
getApplicationEventPublisher().publishEvent(
new AsyncConsumerStoppedEvent(SimpleMessageListenerContainer.this, this.consumer));
}
}
catch (AmqpException e) {
logger.info("Could not cancel message consumer", e);
}
if (aborted && SimpleMessageListenerContainer.this.containerStoppingForAbort
.compareAndSet(null, Thread.currentThread())) {
logger.error("Stopping container from aborted consumer");
stop(); //stop是核心方法,stop是父类AbstractMessageListenerContainer的方法
SimpleMessageListenerContainer.this.containerStoppingForAbort.set(null);
ListenerContainerConsumerFailedEvent event = null;
do {
try {
event = SimpleMessageListenerContainer.this.abortEvents.poll(ABORT_EVENT_WAIT_SECONDS,
TimeUnit.SECONDS);
if (event != null) {
SimpleMessageListenerContainer.this.publishConsumerFailedEvent(
event.getReason(), event.isFatal(), event.getThrowable());
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
while (event != null);
}
}
else { //不满足!isActive(this.consumer) || aborted,会重启consumer,即BlockingQueueConsumer
logger.info("Restarting " + this.consumer);//日志的关键字,也有Restarting关键字
restart(this.consumer);
}
}

}

而AbstractMessageListenerContainer的stop方法,调用的是shutdown方法,shutdown方法,是把AbstractMessageListenerContainer的active设置为false。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer	
public void shutdown(@Nullable Runnable callback) {
synchronized (this.lifecycleMonitor) {
if (!isActive()) {
logger.debug("Shutdown ignored - container is not active already");
this.lifecycleMonitor.notifyAll();
if (callback != null) {
callback.run();
}
return;
}
this.active = false; //设置active为false
this.lifecycleMonitor.notifyAll();
}

logger.debug("Shutting down Rabbit listener container");

// Shut down the invokers.
try {
shutdownAndWaitOrCallback(callback);
}
catch (Exception ex) {
throw convertRabbitAccessException(ex);
}
finally {
setNotRunning();
}
}

而其他consumer,AsyncMessageProcessingConsumer的mainLoop while死循环,判断条件isActive()会使用到AbstractMessageListenerContainer的active方法,其他的AsyncMessageProcessingConsumer consumer,就会进入killOrRestart方法的else方法,重启consumer。(参考AsyncMessageProcessingConsumer的run方法)

1
2
3
4
5
6
7
8
#org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.AsyncMessageProcessingConsumer
private boolean isActive(BlockingQueueConsumer consumer) {
boolean consumerActive;
synchronized (this.consumersMonitor) {
consumerActive = this.consumers != null && this.consumers.contains(consumer);
}
return consumerActive && this.isActive();//this.isActive()调用父类AbstractMessageListenerContainer的方法,判断AbstractMessageListenerContainer的active属性
}

综上,SimpleMessageListenerContainer的reconnect机制,只是有限次数,和线上日志的表现形式一致。

2.DirectMessageListenerContainer

DirectMessageListenerContainer消费者consumer核心逻辑在org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer类,核心方法是actualStart,其中启动consumer的方法是startConsumers,但是startConsumers不是reconnect机制的重点,reconnect机制的核心代码是startMonitor,startMonitor是启动的后台线程,不断检查consumer状态,异常时consumer reconnect,源码片段如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer
protected void actualStart() {
this.aborted = false;
this.hasStopped = false;
if (getPrefetchCount() < this.messagesPerAck) {
setPrefetchCount(this.messagesPerAck);
}
super.doStart();
final String[] queueNames = getQueueNames();
checkMissingQueues(queueNames);
checkConnect();
long idleEventInterval = getIdleEventInterval();
if (this.taskScheduler == null) {
afterPropertiesSet();
}
if (idleEventInterval > 0 && this.monitorInterval > idleEventInterval) {
this.monitorInterval = idleEventInterval / 2;
}
if (getFailedDeclarationRetryInterval() < this.monitorInterval) {
this.monitorInterval = getFailedDeclarationRetryInterval();
}
final Map<String, Queue> namesToQueues = getQueueNamesToQueues();
this.lastRestartAttempt = System.currentTimeMillis();
startMonitor(idleEventInterval, namesToQueues);//consumer reconnect核心代码
if (queueNames.length > 0) {
doRedeclareElementsIfNecessary();
getTaskExecutor().execute(() -> { // NOSONAR never null here
startConsumers(queueNames);//启动consumer,非reconnenct机制重点
});
}
else {
this.started = true;
this.startedLatch.countDown();
}
if (logger.isInfoEnabled()) {
this.logger.info("Container initialized for queues: " + Arrays.asList(queueNames));
}
}

startMonitor方法,源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer
private void startMonitor(long idleEventInterval, final Map<String, Queue> namesToQueues) {
this.consumerMonitorTask = this.taskScheduler.scheduleAtFixedRate(() -> { //固定间隔运行的后台线程
long now = System.currentTimeMillis();
checkIdle(idleEventInterval, now);
checkConsumers(now); //检查consumer是否可用,不可用,放入consumersToRestart队列
if (this.lastRestartAttempt + getFailedDeclarationRetryInterval() < now) {
synchronized (this.consumersMonitor) {
if (this.started) {
List<SimpleConsumer> restartableConsumers = new ArrayList<>(this.consumersToRestart); //consumersToRestart需要重连的consumer
this.consumersToRestart.clear();
if (restartableConsumers.size() > 0) {
doRedeclareElementsIfNecessary();
}
Iterator<SimpleConsumer> iterator = restartableConsumers.iterator();
while (iterator.hasNext()) {
SimpleConsumer consumer = iterator.next();
iterator.remove();
if (DirectMessageListenerContainer.this.removedQueues.contains(consumer.getQueue())) {
if (this.logger.isDebugEnabled()) {
this.logger.debug("Skipping restart of consumer, queue removed " + consumer);
}
continue;
}
if (this.logger.isDebugEnabled()) {
this.logger.debug("Attempting to restart consumer " + consumer);
}
if (!restartConsumer(namesToQueues, restartableConsumers, consumer)) {//重连consumer
break;
}
}
this.lastRestartAttempt = now;
}
}
}
processMonitorTask();
}, Duration.ofMillis(this.monitorInterval));
}

private void checkConsumers(long now) {
final List<SimpleConsumer> consumersToCancel;
synchronized (this.consumersMonitor) {
consumersToCancel = this.consumers.stream()
.filter(consumer -> {
boolean open = consumer.getChannel().isOpen() && !consumer.isAckFailed() //检查consumer是否可用
&& !consumer.targetChanged();
if (open && this.messagesPerAck > 1) {
try {
consumer.ackIfNecessary(now);
}
catch (Exception e) {
this.logger.error("Exception while sending delayed ack", e);
}
}
return !open;
})
.collect(Collectors.toList());
}
consumersToCancel
.forEach(consumer -> {
try {
RabbitUtils.closeMessageConsumer(consumer.getChannel(),
Collections.singletonList(consumer.getConsumerTag()), isChannelTransacted());
}
catch (Exception e) {
if (logger.isDebugEnabled()) {
logger.debug("Error closing consumer " + consumer, e);
}
}
this.logger.error("Consumer canceled - channel closed " + consumer);
consumer.cancelConsumer("Consumer " + consumer + " channel closed"); //不可用的consumer,调用SimpleConsumer的cancel方法
});
}

SimpleConsumer的cancelConsumer方法,把当前SimpleConsumer调用DirectMessageListenerContainer的addConsumerToRestart方法,添加到consumersToRestart集合。

1
2
3
4
5
6
7
8
9
10
11
12
13
# org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer.SimpleConsumer
void cancelConsumer(final String eventMessage) {
publishConsumerFailedEvent(eventMessage, true, null);
synchronized (DirectMessageListenerContainer.this.consumersMonitor) {
List<SimpleConsumer> list = DirectMessageListenerContainer.this.consumersByQueue.get(this.queue);
if (list != null) {
list.remove(this);
}
DirectMessageListenerContainer.this.consumers.remove(this);
addConsumerToRestart(this);//调用SimpleConsumer自己,添加到DirectMessageListenerContainer的consumersToRestart
}
finalizeConsumer();
}

DirectMessageListenerContainer的addConsumerToRestart方法:

1
2
3
4
5
6
7
# org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer
private void addConsumerToRestart(SimpleConsumer consumer) {
this.consumersToRestart.add(consumer);
if (this.logger.isTraceEnabled()) {
this.logger.trace("Consumers to restart now: " + this.consumersToRestart);
}
}

综上,DirectMessageListenerContainer的reconnect机制,是因为有后台线程保证,只是无限次数,和线上日志的表现形式一致。

3.总结

DirectMessageListenerContainer的reconnect机制,比SimpleMessageListenerContainer完备。如果rabbitmq server段,闪断的情况,SimpleMessageListenerContainer可以处理,如果时间一长,SimpleMessageListenerContainer的reconnect实现机制,就不能处理,而DirectMessageListenerContainer的实现机制更合理、更完备。

三、修复方案

本着,最小改动,最小代价修复问题的方针,方案:修改@RabbitListener注解的containerFactory。根据javadoc,containerFactory return的是org.springframework.amqp.rabbit.listener.RabbitListenerContainerFactory,RabbitListenerContainerFactory的几种实现如下,

只需要,选择org.springframework.amqp.rabbit.config.DirectRabbitListenerContainerFactory实现就好,springboot其实已经贴心留好了钩子,只需要设置属性spring.rabbitmq.listener=direct即可切换containerFactory为DirectRabbitListenerContainerFactory,不需要更改代码。

另外注意切换containerFactory后,如果使用了@RabbitListener注解的concurrency并发度,可能需要适配修改。原因如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/**
* Set the concurrency of the listener container for this listener. Overrides the
* default set by the listener container factory. Maps to the concurrency setting of
* the container type.
* <p>For a
* {@link org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer
* SimpleMessageListenerContainer} if this value is a simple integer, it sets a fixed
* number of consumers in the {@code concurrentConsumers} property. If it is a string
* with the form {@code "m-n"}, the {@code concurrentConsumers} is set to {@code m}
* and the {@code maxConcurrentConsumers} is set to {@code n}.
* <p>For a
* {@link org.springframework.amqp.rabbit.listener.DirectMessageListenerContainer
* DirectMessageListenerContainer} it sets the {@code consumersPerQueue} property.
* @return the concurrency.
* @since 2.0
*/
String concurrency() default ""; //SimpleMessageListenerContainer的concurrency是自适应的,之前生
posted @ 2022-02-10 17:38  木木米  阅读(4325)  评论(0)    收藏  举报