【Azure Redis】Redis服务端的故障转移(Failover)导致客户端应用出现15分钟超时问题的模拟及解决
问题描述
使用Azure Cache for Redis服务,遇见了因服务端的维护而触发故障转移,因为客户端是在Linux环境中,并且使用了Lettuce SDK,因为Lettuce目前有一个超时15分钟的known issue。
问题解答
如何主动复现故障转移呢?
通过Azure Cache for Redis的Azure门户,进入Reboot“重启节点”。注意,只有重启Primary才能模拟故障转移(Failover)场景。
那些措施可以解决问题呢?
1、修改TCP settings,以减少tcp retransmission的时间以缓解该问题: net.ipv4.tcp_retries2 = 5
2、For Spring Boot integrated with Lettuce,重写ClientResources类文件,以缓解该问题:
3、Lettuce SDK的开源社区中也提到了一种修改TCP_USER_TIMEOUT的方式以缓解该问题,对于直接使用Lettuce SDK的方式可以参考:
- Use Lettuce >= 6.3.0
<dependencies> <dependency> <groupId>io.lettuce</groupId> <artifactId>lettuce-core</artifactId> <version>6.3.0.RELEASE</version> </dependency> <dependency> <groupId>io.netty</groupId> <artifactId>netty-transport-native-epoll</artifactId> <version>4.1.100.Final</version> <classifier>linux-x86_64</classifier> </dependency> </dependencies>
- Config TCP_USER_TIMEOUT
import io.lettuce.core.ClientOptions; import io.lettuce.core.RedisClient; import io.lettuce.core.RedisURI; import io.lettuce.core.SocketOptions; import io.lettuce.core.SocketOptions.KeepAliveOptions; import io.lettuce.core.SocketOptions.TcpUserTimeoutOptions; import io.lettuce.core.api.StatefulRedisConnection; import io.lettuce.core.api.sync.RedisCommands; import java.time.Duration; public class LettuceExample { /** * Enable TCP keepalive and configure the following three parameters: * TCP_KEEPIDLE = 30 * TCP_KEEPINTVL = 10 * TCP_KEEPCNT = 3 */ private static final int TCP_KEEPALIVE_IDLE = 30; /** * The TCP_USER_TIMEOUT parameter can avoid situations where Lettuce remains stuck in a continuous timeout loop during a failure or crash event. * refer: https://github.com/lettuce-io/lettuce-core/issues/2082 */ private static final int TCP_USER_TIMEOUT = 30; private static RedisClient client = null; private static StatefulRedisConnection<String, String> connection = null; public static void main(String[] args) { // Replace the values of host, user, password, and port with the actual instance information. String host = "r-bp1s1bt2tlq3p1****.redis.rds.aliyuncs.com"; String user = "r-bp1s1bt2tlq3p1****"; String password = "Da****3"; int port = 6379; // Config RedisURL RedisURI uri = RedisURI.Builder .redis(host, port) .withAuthentication(user, password) .build(); // Config TCP KeepAlive SocketOptions socketOptions = SocketOptions.builder() .keepAlive(KeepAliveOptions.builder() .enable() .idle(Duration.ofSeconds(TCP_KEEPALIVE_IDLE)) .interval(Duration.ofSeconds(TCP_KEEPALIVE_IDLE/3)) .count(3) .build()) .tcpUserTimeout(TcpUserTimeoutOptions.builder() .enable() .tcpUserTimeout(Duration.ofSeconds(TCP_USER_TIMEOUT)) .build()) .build(); client = RedisClient.create(uri); client.setOptions(ClientOptions.builder() .socketOptions(socketOptions) .build()); connection = client.connect(); RedisCommands<String, String> commands = connection.sync(); System.out.println(commands.set("foo", "bar")); System.out.println(commands.get("foo")); // If your application exits and you want to destroy the resources, call this method. Then, the connection is closed, and the resources are released. connection.close(); client.shutdown(); } }
4、使用其他SDK例如Jedis等,来规避该类问题。
参考资料
Linux 托管客户端应用程序的 TCP 设置 : https://docs.azure.cn/zh-cn/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications
Add support for disconnect on timeout to recover early from no RST
packet failures : https://github.com/redis/lettuce/issues/2082#issuecomment-2290496556
当在复杂的环境中面临问题,格物之道需:浊而静之徐清,安以动之徐生。 云中,恰是如此!