nodemanager频繁重启

【问题现象】

现场反馈yarn的slave节点的nodemanager进程无故自动重启

【问题定位】
1、通过查看nodemanager的日志,发现STARTUP_MSG之前并未有任何异常信息,初步判断是否改进程被kill掉了。

2022-03-08 15:45:01 [ContainersLauncher #2100] INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(806) -Deleting absolute path : /data/sata/2/nm-local-dir/usercache/appcache/application_1646449533123_6428/container_e07_1646449533123_6428_01_000003/container_tokens
2022-03-08 15:45:01 [ContainersLauncher #2100] INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(806) -Deleting absolute path : /data/sata/2/nm-local-dir/usercache/appcache/application_1646449533123_6428/container_e07_1646449533123_6428_01_000003/sysfs
2022-03-08 15:45:03 [PublicLocalizer #21] INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(239) -SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-03-08 15:45:03 [PublicLocalizer #21] INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(239) -SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-03-08 15:45:03 [PublicLocalizer #21] INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(239) -SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-03-08 15:45:03 [Node Status Updater] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(721) -Removed completed containers from NM context: [container_e07_1646449533123_6428_01_000003]
2022-03-08 15:45:06 [PublicLocalizer #21] INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(239) -SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-03-08 15:45:46 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager.info(51) -STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: host = hh-100.node.cluster.sxst/151.100.300.103
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.2.1
STARTUP_MSG: build = Unknown -r Unknown; compiled by 'root' on 2021-07-29T02:59Z
STARTUP_MSG: java = 1.8.0_291
************************************************************/
2022-03-08 15:45:46 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager.info(51) -registered UNIX signal handlers for [TERM, HUP, INT]
2022-03-08 15:45:46 [main] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.ResourcePluginManager.getPluginsFromConfig(72) -No Resource plugins found from configuration!
2022-03-08 15:45:46 [main] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.ResourcePluginManager.getPluginsFromConfig(74) -Found Resource plugins from configuration: null
2022-03-08 15:45:47 [main] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initNetworkResourceHandler(200) -Using traffic control bandwidth handler


2、查看系统日志/var/log/messages日志。

Mar 9 02:46:36 master-0: [ERROR] 2022/03/09 02:46:36 log.go:58: [process NodeManager] exec cmd [ps -ef |grep NodeManager |grep -v grep] result [exit status 1] err [].
Mar 9 02:46:36 master-0: [WARN] 2022/03/09 02:46:36 log.go:58: [process NodeManager] is unhealth.

3、查看故障日志/var/log/dmesg.log,发现有被Kill的操作

[1488922.976511] Memory cgroup out of memory: Kill process 27410 (java) score 1953 or sacrifice child
[1488922.976602] Killed process 27084 (java) total-vm:59366936kB, anon-rss:1993252kB, file-rss:15528kB, shmem-rss:0kB

【结论】
发生oom故障
备注:
资源分配不均导致pod真实需要的资源超过了物理机分配的资源,k8s会对组件的诸多进程进行打分,会直接kill掉评分较低的进程,因此,发生如上现象。

posted @ 2022-03-09 16:13  小小程序员_sjk  阅读(398)  评论(0编辑  收藏  举报