dubbo、zookeeper心跳相关参数解析与测试

  写在开头,zk客户端、服务器对负载比较敏感,对于类似大数据处理的应用,zk心跳时间设置和监测很关键,否则非常容易系统不稳定,建议可能长时间高负载导致GC时间过长的非OLTP的尽量不使用zk或rpc,而是使用MQ或HTTP。

dubbo consumer和provider的心跳机制

  dubbo客户端和dubbo服务端之间存在心跳,目的是维持provider和consumer之间的长连接。由dubbo客户端主动发起,可参见dubbo源码 HeartbeatTask。dubbo心跳时间heartbeat默认是60s,超过heartbeat时间没有收到消息,就发送心跳消息(provider,consumer一样),如果连着3次(heartbeatTimeout为heartbeat*3)没有收到心跳响应(所以如果是批处理的话,很可能就会无响应导致被踢掉(例如gc时间超过1分钟,亦或是并发过高,cpu长时间100%使得心跳线程无法被调度),此时就需要加长超时次数或心跳值(我司用的是改过的版本,超时时间默认15秒,所以LZ改源码读application.properties配置自己实现了)【本质上,rpc长连接不适合于服务需要长时间完成的场景,只不过历史问题,应该轮询或发消息】),provider会关闭channel,而consumer会进行重连;不论是provider还是consumer的心跳检测都是通过启动定时任务的方式实现。

  • provider绑定和consumer连接的入口:
public class HeaderExchanger implements Exchanger {

    public static final String NAME = "header";

    @Override
    public ExchangeClient connect(URL url, ExchangeHandler handler) throws RemotingException {
        return new HeaderExchangeClient(Transporters.connect(url, new DecodeHandler(new HeaderExchangeHandler(handler))), true);
    }

    @Override
    public ExchangeServer bind(URL url, ExchangeHandler handler) throws RemotingException {
        return new HeaderExchangeServer(Transporters.bind(url, new DecodeHandler(new HeaderExchangeHandler(handler))));
    }

}

  

  • provider启动心跳检测
public HeaderExchangeServer(Server server) {
        if (server == null) {
            throw new IllegalArgumentException("server == null");
        }
        this.server = server;
        this.heartbeat = server.getUrl().getParameter(Constants.HEARTBEAT_KEY, 0);
        //心跳超时时间默认为心跳时间的3倍
        this.heartbeatTimeout = server.getUrl().getParameter(Constants.HEARTBEAT_TIMEOUT_KEY, heartbeat * 3);
        //如果心跳超时时间小于心跳时间的两倍则抛异常
        if (heartbeatTimeout < heartbeat * 2) {
            throw new IllegalStateException("heartbeatTimeout < heartbeatInterval * 2");
        }
        startHeatbeatTimer();
    }

  

  • startHeatbeatTimer的实现 
    • 先停止已有的定时任务,启动新的定时任务
private void startHeatbeatTimer() {
        // 停止原有定时任务
        stopHeartbeatTimer();
        // 发起新的定时任务
        if (heartbeat > 0) {
            heatbeatTimer = scheduled.scheduleWithFixedDelay(
                    new HeartBeatTask(new HeartBeatTask.ChannelProvider() {
                        public Collection<Channel> getChannels() {
                            return Collections.unmodifiableCollection(HeaderExchangeServer.this.getChannels());
                        }
                    }, heartbeat, heartbeatTimeout),
                    heartbeat, heartbeat, TimeUnit.MILLISECONDS);
        }
    }

 

  • HeartBeatTask的实现 
    • 遍历所有的channel,检测心跳间隔,如果超过心跳间隔没有读或写,则发送需要回复的心跳消息,最有判断是否心跳超时(heartbeatTimeout),如果超时,provider关闭channel,consumer进行重连
public void run() {
        try {
            long now = System.currentTimeMillis();
            for (Channel channel : channelProvider.getChannels()) {
                if (channel.isClosed()) {
                    continue;
                }
                try {
                    Long lastRead = (Long) channel.getAttribute(HeaderExchangeHandler.KEY_READ_TIMESTAMP);
                    Long lastWrite = (Long) channel.getAttribute(HeaderExchangeHandler.KEY_WRITE_TIMESTAMP);
                    // 读写的时间,任一超过心跳间隔,发送心跳
                    if ((lastRead != null && now - lastRead > heartbeat)
                            || (lastWrite != null && now - lastWrite > heartbeat)) {
                        Request req = new Request();
                        req.setVersion("2.0.0");
                        req.setTwoWay(true); // 需要响应的心跳事件
                        req.setEvent(Request.HEARTBEAT_EVENT);
                        channel.send(req);
                        if (logger.isDebugEnabled()) {
                            logger.debug("Send heartbeat to remote channel " + channel.getRemoteAddress()
                                    + ", cause: The channel has no data-transmission exceeds a heartbeat period: " + heartbeat + "ms");
                        }
                    }
                    // 最后读的时间,超过心跳超时时间
                    if (lastRead != null && now - lastRead > heartbeatTimeout) {
                        logger.warn("Close channel " + channel
                                + ", because heartbeat read idle time out: " + heartbeatTimeout + "ms");
                        // 客户端侧,重新连接服务端
                        if (channel instanceof Client) {
                            try {
                                ((Client) channel).reconnect();
                            } catch (Exception e) {
                                //do nothing
                            }
                        // 服务端侧,关闭客户端连接
                        } else {
                            channel.close();
                        }
                    }
                } catch (Throwable t) {
                    logger.warn("Exception when heartbeat to remote channel " + channel.getRemoteAddress(), t);
                }
            }
        } catch (Throwable t) {
            logger.warn("Unhandled exception when heartbeat, cause: " + t.getMessage(), t);
        }
    }

 

  • consumer端的实现 
    • 默认需要心跳检测
public HeaderExchangeClient(Client client, boolean needHeartbeat) {
        if (client == null) {
            throw new IllegalArgumentException("client == null");
        }
        this.client = client;
        // 创建 HeaderExchangeChannel 对象
        this.channel = new HeaderExchangeChannel(client);
        // 读取心跳相关配置
        String dubbo = client.getUrl().getParameter(Constants.DUBBO_VERSION_KEY);
        this.heartbeat = client.getUrl().getParameter(Constants.HEARTBEAT_KEY, dubbo != null && dubbo.startsWith("1.0.") ? Constants.DEFAULT_HEARTBEAT : 0);
        this.heartbeatTimeout = client.getUrl().getParameter(Constants.HEARTBEAT_TIMEOUT_KEY, heartbeat * 3);
        if (heartbeatTimeout < heartbeat * 2) { // 避免间隔太短
            throw new IllegalStateException("heartbeatTimeout < heartbeatInterval * 2");
        }
        // 发起心跳定时器
        if (needHeartbeat) {
            startHeatbeatTimer();
        }

dubbo客户端/服务端和注册中心(zk)存在心跳

  由dubbo客户端或服务端发起,这是基于zk集群和zk客户端之间的心跳机制。由zk服务器参数tickTime(这个时间是作为Zookeeper服务器之间或客户端与服务器之间维持心跳的时间间隔,每隔tickTime时间就会发送一个心跳;最小的session过期时间为2倍tickTime)控制间隔,但是实际情况是我们发现心跳间隔是tickTime的1/2(此例中服务器太忙,以至于zk客户端没有及时给服务器发心跳),如下:

[] 2019-08-07 14:51:46 [5189311] [ERROR] Curator-Framework-0 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:200) Connection timed out for connection string (localhost:2181) and timeout (5000) / elapsed (27053)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.10.0.jar!/:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_211]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_211]
[] 2019-08-07 14:51:46 [5190082] [WARN] Curator-Framework-0-SendThread(0:0:0:0:0:0:0:1:2181) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1102) Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_211]
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:1.8.0_211]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[zookeeper-3.4.6.jar!/:3.4.6-1569965]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [zookeeper-3.4.6.jar!/:3.4.6-1569965]
[] 2019-08-07 14:51:47 [5190312] [ERROR] Curator-Framework-0 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:200) Connection timed out for connection string (localhost:2181) and timeout (5000) / elapsed (28054)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.10.0.jar!/:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_211]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_211]
[] 2019-08-07 14:51:47 [5191182] [INFO] Curator-Framework-0-SendThread(k3ctest.yidooo.com:2181) org.apache.zookeeper.ClientCnxn$SendThread.logStartConnect(ClientCnxn.java:975) Opening socket connection to server k3ctest.yidooo.com/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
[] 2019-08-07 14:51:48 [5191314] [ERROR] Curator-Framework-0 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:200) Connection timed out for connection string (localhost:2181) and timeout (5000) / elapsed (29056)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116) [curator-client-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.10.0.jar!/:?]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.10.0.jar!/:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_211]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_211]

在启动的时候,是可以看到客户端设置的超时时间及和服务端协商后确定的超时时间。如下:

21:28:57.442 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@5afd2f4e
21:28:57.475 [main-SendThread(localhost:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
21:28:57.477 [main-SendThread(localhost:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
21:28:57.508 [main-SendThread(localhost:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x10025cfdf590000, negotiated timeout = 60000
21:28:57.515 [main-EventThread] INFO  org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED

 

zk集群节点间的同步时间控制

(1)initLimit

  此配置表示,允许follower(相对于Leader言的“客户端”)连接并同步到Leader的初始化连接时间,以tickTime(这是最基础的参数,设置了所有时间相关的参数的基本单位)为单位。当初始化连接时间超过该值,则表示连接失败。

(2)syncLimit

  此配置项表示Leader与Follower之间发送消息时,请求和应答时间长度。如果follower在设置时间内不能与leader通信,那么此follower将会被丢弃。

客户端的会话超时时间

  对于会话的超时时间,客户端将sessionTimeout的值传给zk时,zk还会根据minSessionTimeout(默认为tickTime的2倍)与maxSessionTimeout(默认为tickTime的20倍)两个参数重新调整最后的超时值,所以默认40秒就会超时了,所以对于超长的问题例如gc导致服务器STW时间较长,应确保一次gc的STW时间小于等于tickTime,可以极大的缓解zk重连,也可以考虑加大maxSessionTimeout和minSessionTimeout。  

public int getMinSessionTimeout() {  
    return minSessionTimeout == -1 ? tickTime * 2 : minSessionTimeout;  
}  
  
  
public int getMaxSessionTimeout() {  
    return maxSessionTimeout == -1 ? tickTime * 20 : maxSessionTimeout;  
}  
int minSessionTimeout = zk.getMinSessionTimeout();  
if (sessionTimeout < minSessionTimeout) {  
    sessionTimeout = minSessionTimeout;  
}  
int maxSessionTimeout = zk.getMaxSessionTimeout();  
if (sessionTimeout > maxSessionTimeout) {  
    sessionTimeout = maxSessionTimeout;  
}

这俩参数的含义如下:

minSessionTimeout

(No Java system property)

New in 3.3.0: the minimum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 2 times the tickTime.

maxSessionTimeout

(No Java system property)

New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times the tickTime.

  最后要知道Leader节点是单点的,负责所有事务的协调,如果leader挂掉,需要知道它如何被重新选举出,可以参考:zookeeper核心原理详解

posted @ 2019-08-07 14:59  zhjh256  阅读(8285)  评论(1编辑  收藏  举报