Hbase设置多个hmaster

为了保证HBase集群的高可靠性,HBase支持多Backup Master 设置。当Active Master挂掉后,Backup Master可以自动接管整个HBase的集群。

该配置极其简单:

在$HBASE_HOME/conf/ 目录下新增文件配置backup-masters,在其内添加要用做Backup Master的节点hostname。如下:

[hbase@master conf]$ cat backup-masters 
node1

 

之后,启动整个集群,我们会发现,在master和node1上,都启动了HMaster进程:

[hbase@master conf]$ jps
25188 NameNode
3319 QuorumPeerMain
31725 Jps
25595 ResourceManager
31077 HMaster
25711 NodeManager
25303 DataNode
31617 Main
31220 HRegionServer

 

[hbase@node1 root]$ jps
11560 DataNode
11762 NodeManager
20769 Jps
415 QuorumPeerMain
11675 SecondaryNameNode
20394 HRegionServer
20507 HMaster

此时查看node1上master节点的log,可以看到如下的信息:

[hbase@node1 logs]$ tail -f hbase-hbase-master-node1.log
2015-10-10 05:35:09,609 INFO  [main] mortbay.log: Started SelectChannelConnector@0.0.0.0:60010
2015-10-10 05:35:09,613 INFO  [main] master.HMaster: hbase.rootdir=hdfs://master:9000/hbase, hbase.cluster.distributed=true
2015-10-10 05:35:09,631 INFO  [main] master.HMaster: Adding backup master ZNode /hbase/backup-masters/node1,60000,1444455307700
2015-10-10 05:35:09,806 INFO  [node1:60000.activeMasterManager] master.ActiveMasterManager: Another master is the active master, master,60000,1444455305852; waiting to become the next active master
2015-10-10 05:35:09,858 INFO  [master/node1/10.0.52.145:60000] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x10135dbc connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181
2015-10-10 05:35:09,858 INFO  [master/node1/10.0.52.145:60000] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x10135dbc0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2015-10-10 05:35:09,859 INFO  [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Opening socket connection to server node2/10.0.52.146:2181. Will not attempt to authenticate using SASL (unknown error)
2015-10-10 05:35:09,860 INFO  [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Socket connection established to node2/10.0.52.146:2181, initiating session
2015-10-10 05:35:09,885 INFO  [master/node1/10.0.52.145:60000-SendThread(node2:2181)] zookeeper.ClientCnxn: Session establishment complete on server node2/10.0.52.146:2181, sessionid = 0x350463058c10017, negotiated timeout = 40000
2015-10-10 05:35:09,920 INFO  [master/node1/10.0.52.145:60000] regionserver.HRegionServer: ClusterId : c309a039-eb35-400c-bb13-0b6ed939cc5e

该信息说明,当前hbase集群有活动的master节点,该master节点为master,所以node1节点开始等待,直到master节点上的hmaster挂掉。slave1会变成新的Active 的 Master节点。

此时,直接kill掉master节点上HMaster进程,查看node1上master节点log会发现:

2015-10-10 05:42:17,173 INFO  [node1:60000.activeMasterManager] master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/node1,60000,1444455307700 from backup master directory
2015-10-10 05:42:17,194 INFO  [node1:60000.activeMasterManager] master.ActiveMasterManager: Registered Active Master=node1,60000,1444455307700
2015-10-10 05:42:17,758 INFO  [node1:60000.activeMasterManager] fs.HFileSystem: Added intercepting call to namenode#getBlockLocations so can do block reordering using class class org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks
2015-10-10 05:42:17,776 INFO  [node1:60000.activeMasterManager] coordination.SplitLogManagerCoordination: Found 0 orphan tasks and 0 rescan nodes
2015-10-10 05:42:17,880 INFO  [node1:60000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x29d405f7 connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181
2015-10-10 05:42:17,880 INFO  [node1:60000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x29d405f70x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2015-10-10 05:42:17,883 INFO  [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Opening socket connection to server node2/10.0.52.146:2181. Will not attempt to authenticate using SASL (unknown error)
2015-10-10 05:42:17,884 INFO  [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Socket connection established to node2/10.0.52.146:2181, initiating session
2015-10-10 05:42:17,904 INFO  [node1:60000.activeMasterManager-SendThread(node2:2181)] zookeeper.ClientCnxn: Session establishment complete on server node2/10.0.52.146:2181, sessionid = 0x350463058c1001b, negotiated timeout = 40000
2015-10-10 05:42:17,942 INFO  [node1:60000.activeMasterManager] balancer.StochasticLoadBalancer: loading config
2015-10-10 05:42:18,061 INFO  [node1:60000.activeMasterManager] master.HMaster: Server active/primary master=node1,60000,1444455307700, sessionid=0x150463058ac001a, setting cluster-up flag (Was=true)
2015-10-10 05:42:18,154 INFO  [node1:60000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort
2015-10-10 05:42:18,184 INFO  [node1:60000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/flush-table-proc/acquired /hbase/flush-table-proc/reached /hbase/flush-table-proc/abort
2015-10-10 05:42:18,256 INFO  [node1:60000.activeMasterManager] master.MasterCoprocessorHost: System coprocessor loading is enabled
2015-10-10 05:42:18,286 INFO  [node1:60000.activeMasterManager] procedure2.ProcedureExecutor: Starting procedure executor threads=5
2015-10-10 05:42:18,288 INFO  [node1:60000.activeMasterManager] wal.WALProcedureStore: Starting WAL Procedure Store lease recovery
2015-10-10 05:42:18,296 INFO  [node1:60000.activeMasterManager] util.FSHDFSUtils: Recovering lease on dfs file hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log
2015-10-10 05:42:18,307 INFO  [node1:60000.activeMasterManager] util.FSHDFSUtils: recoverLease=true, attempt=0 on file=hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log after 9ms
2015-10-10 05:42:18,324 WARN  [node1:60000.activeMasterManager] wal.WALProcedureStore: Unable to read tracker for hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log - Missing trailer: size=9 startPos=9
2015-10-10 05:42:18,373 INFO  [node1:60000.activeMasterManager] wal.WALProcedureStore: Lease acquired for flushLogId: 28
2015-10-10 05:42:18,383 WARN  [node1:60000.activeMasterManager] wal.ProcedureWALFormatReader: nothing left to decode. exiting with missing EOF
2015-10-10 05:42:18,383 INFO  [node1:60000.activeMasterManager] wal.ProcedureWALFormatReader: No active entry found in state log hdfs://master:9000/hbase/MasterProcWALs/state-00000000000000000027.log. removing it
2015-10-10 05:42:18,405 INFO  [node1:60000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181
2015-10-10 05:42:18,405 INFO  [node1:60000.activeMasterManager] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=replicationLogCleaner0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2015-10-10 05:42:18,407 INFO  [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Opening socket connection to server node1/10.0.52.145:2181. Will not attempt to authenticate using SASL (unknown error)
2015-10-10 05:42:18,408 INFO  [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Socket connection established to node1/10.0.52.145:2181, initiating session
2015-10-10 05:42:18,426 INFO  [node1:60000.activeMasterManager-SendThread(node1:2181)] zookeeper.ClientCnxn: Session establishment complete on server node1/10.0.52.145:2181, sessionid = 0x250463058780018, negotiated timeout = 40000
2015-10-10 05:42:18,464 INFO  [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2015-10-10 05:42:19,970 INFO  [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2015-10-10 05:42:21,475 INFO  [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3011 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2015-10-10 05:42:22,980 INFO  [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 4516 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2015-10-10 05:42:23,058 INFO  [PriorityRpcServer.handler=3,queue=1,port=60000] master.ServerManager: Registering server=node1,16020,1444455306545
2015-10-10 05:42:23,059 INFO  [PriorityRpcServer.handler=5,queue=1,port=60000] master.ServerManager: Registering server=master,16020,1444455306763
2015-10-10 05:42:23,060 INFO  [PriorityRpcServer.handler=1,queue=1,port=60000] master.ServerManager: Registering server=node2,16020,1444455305886
2015-10-10 05:42:23,081 INFO  [node1:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 4617 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2015-10-10 05:42:24,586 INFO  [node1:60000.activeMasterManager] master.ServerManager: Finished waiting for region servers count to settle; checked in 3, slept for 6122 ms, expecting minimum of 1, maximum of 2147483647, master is running
2015-10-10 05:42:24,610 INFO  [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/master,16020,1444455306763 belongs to an existing region server
2015-10-10 05:42:24,619 INFO  [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/node1,16020,1444455306545 belongs to an existing region server
2015-10-10 05:42:24,625 INFO  [node1:60000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://master:9000/hbase/WALs/node2,16020,1444455305886 belongs to an existing region server
2015-10-10 05:42:24,757 INFO  [node1:60000.activeMasterManager] master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1444455744651, server=null} to {1588230740 state=OPEN, ts=1444455744756, server=node2,16020,1444455305886}
2015-10-10 05:42:24,757 INFO  [node1:60000.activeMasterManager] master.ServerManager: AssignmentManager hasn't finished failover cleanup; waiting
2015-10-10 05:42:24,760 INFO  [node1:60000.activeMasterManager] master.HMaster: hbase:meta with replicaId 0 assigned=0, rit=false, location=node2,16020,1444455305886
2015-10-10 05:42:24,895 INFO  [node1:60000.activeMasterManager] hbase.MetaMigrationConvertingToPB: META already up-to date with PB serialization
2015-10-10 05:42:24,985 INFO  [node1:60000.activeMasterManager] master.AssignmentManager: Found regions out on cluster or in RIT; presuming failover
2015-10-10 05:42:25,000 INFO  [node1:60000.activeMasterManager] master.AssignmentManager: Joined the cluster in 104ms, failover=true
2015-10-10 05:42:25,216 INFO  [node1:60000.activeMasterManager] master.HMaster: Master has completed initialization
2015-10-10 05:42:25,234 INFO  [node1:60000.activeMasterManager] quotas.MasterQuotaManager: Quota support disabled

可见,node1节点上Backup Master 已经结果HMaster,成为Active HMaster

重新启动master节点上的hmaster

[hbase@master bin]$ ./hbase-daemon.sh start master 
starting master, logging to /usr/local/hbase//logs/hbase-hbase-master-master.out
[hbase@master bin]$ jps
25188 NameNode
32351 Jps
3319 QuorumPeerMain
32265 HMaster
25595 ResourceManager
25711 NodeManager
25303 DataNode
31220 HRegionServer

查看master节点的log发现,它变为了backup master

[hbase@master logs]$ tail -f  hbase-hbase-master-master.log
2015-10-10 05:53:15,329 INFO  [main] mortbay.log: Started SelectChannelConnector@0.0.0.0:60010
2015-10-10 05:53:15,333 INFO  [main] master.HMaster: hbase.rootdir=hdfs://master:9000/hbase, hbase.cluster.distributed=true
2015-10-10 05:53:15,348 INFO  [main] master.HMaster: Adding backup master ZNode /hbase/backup-masters/master,60000,1444456393819
2015-10-10 05:53:15,488 INFO  [master:60000.activeMasterManager] master.ActiveMasterManager: Another master is the active master, node1,60000,1444455307700; waiting to become the next active master
2015-10-10 05:53:15,522 INFO  [master/master/10.0.52.144:60000] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x323b7deb connecting to ZooKeeper ensemble=master:2181,node1:2181,node2:2181
2015-10-10 05:53:15,522 INFO  [master/master/10.0.52.144:60000] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=90000 watcher=hconnection-0x323b7deb0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2015-10-10 05:53:15,524 INFO  [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Opening socket connection to server master/10.0.52.144:2181. Will not attempt to authenticate using SASL (unknown error)
2015-10-10 05:53:15,525 INFO  [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Socket connection established to master/10.0.52.144:2181, initiating session
2015-10-10 05:53:15,536 INFO  [master/master/10.0.52.144:60000-SendThread(master:2181)] zookeeper.ClientCnxn: Session establishment complete on server master/10.0.52.144:2181, sessionid = 0x150463058ac001c, negotiated timeout = 40000
2015-10-10 05:53:15,567 INFO  [master/master/10.0.52.144:60000] regionserver.HRegionServer: ClusterId : c309a039-eb35-400c-bb13-0b6ed939cc5e

 

posted on 2015-10-10 13:45  诗圆  阅读(8922)  评论(1编辑  收藏  举报

导航