博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

oom-killer dmesg /var/log/message

Posted on 2014-11-19 22:16  bw_0927  阅读(2391)  评论(0)    收藏  举报

http://blog.chinaunix.net/uid-29242873-id-3942763.html

http://hongjiang.info/tag/oom-killer/

 

#dmesg -r  

print the raw message buffer, i.e., don’t strip the log level prefixes.

 

Linux 内核有个机制叫OOM killer(Out-Of-Memory killer),该机制会监控那些占用内存过大,尤其是瞬间很快消耗大量内存的进程,为了防止内存耗尽而内核会把该进程杀掉。典型的情况是:某天一台机器突然ssh远程登录不了,但能ping通,说明不是网络的故障,原因是sshd进程被OOM killer杀掉了(多次遇到这样的假死状况)。重启机器后查看系统日志/var/log/messages会发现Out of Memory: Kill process 1865(sshd)类似的错误信息。

 

防止重要的系统进程触发(OOM)机制而被杀死:可以设置参数/proc/PID/oom_adj为-17,可临时关闭linux内核的OOM机制。内核会通过特定的算法给每个进程计算一个分数来决定杀哪个进程,每个进程的oom分数可以/proc/PID/oom_score中找到。我们运维过程中保护的一般是sshd和一些管理agent。

 

保护某个进程不被内核杀掉可以这样操作:

点击(此处)折叠或打开

echo -17 > /proc/$PID/oom_adj

如何防止sshd被杀,可以这样操作:

点击(此处)折叠或打开

pgrep -f "/usr/sbin/sshd" | while read PID;do echo -17 > /proc/$PID/oom_adj;done

可以在计划任务里加入这样一条定时任务,就更安全了:

点击(此处)折叠或打开

#/etc/cron.d/oom_disable

*/1**** root pgrep -f "/usr/sbin/sshd" | while read PID;do echo -17 > /proc/$PID/oom_adj;done

为了避免重启失效,可以写入/etc/rc.d/rc.local

点击(此处)折叠或打开

echo -17 > /proc/$(pidof sshd)/oom_adj

至于为什么用-17而不用其他数值(默认值为0),这个是由linux内核定义的,查看内核源码可知:
以linux-3.3.6版本的kernel源码为例,路径为linux-3.6.6/include/linux/oom.h,阅读内核源码可知oom_adj的可调值为15到-16,其中15最大-16最小,-17为禁止使用OOM。oom_score为2的n次方计算出来的,其中n就是进程的oom_adj值,所以oom_score的分数越高就越会被内核优先杀掉。

 
当然还可以通过修改内核参数禁止OOM机制

 

点击(此处)折叠或打开

# sysctl -w vm.panic_on_oom=1
vm.panic_on_oom = 1 //1表示关闭,默认为0表示开启OOM
 
# sysctl -p

 
为了验证OOM机制的效果,我们不妨做个测试。

首先看看我系统现有内存大小,没错96G多,物理上还要比查看的值大一些。

 

再看看目前进程最大的有哪些,top查看,我目前只跑了两个java程序的进程,分别4.6G,再往后redis进程吃了21m,iscsi服务占了32m,gdm占了25m,其它的进程都是几M而已。

 

现在我自己用C写一个叫bigmem程序,我指定该程序分配内存85G,呵呵,效果明显,然后执行后再用top查看,排在第一位的是我的bigmem,RES是物理内存,已经吃满了85G。

 

继续观察,当bigmem稳定保持在85G一会后,内核会自动将其进程kill掉,增长的过程中没有被杀,如果不希望被杀可以执行

点击(此处)折叠或打开

pgrep -f "bigmem" | while read PID; do echo -17 > /proc/$PID/oom_adj;done

执行以上命令前后,明显会对比出效果,就可以体会到内核OOM机制的实际作用了。

 

注意:

1.Kernel-2.6.26之前版本的oomkiller算法不够精确,RHEL 6.x版本的2.6.32可以解决这个问题。

2.子进程会继承父进程的oom_adj。

3.OOM不适合于解决内存泄漏(Memory leak)的问题。

4.有时free查看还有充足的内存,但还是会触发OOM,是因为该进程可能占用了特殊的内存地址空间。

 

 

$ rpm -qf /bin/dmesg

-f, --file FILE
Query package owning FILE.

 

===================低端内存,高端内存

http://blog.csdn.net/eroswang/article/details/5264667

# free -lm
total used free shared buffers cached
Mem: 5769 5751 17 0 8 5267
Low: 777 760 16 0 0 0
High: 4991 4990 1 0 0 0
-/+ buffers/cache: 475 5293
Swap: 4773 0 4773

When low memory is exhausted, it doesn't matter how much high memory is
available, the oom-killer will begin whacking processes to keep the
server alive.

If possible, upgrade to 64-bit Linux. This is the best solution because
*all* memory becomes low memory. If you run out of low memory in this
case, then you're *really* out of memory. ;-)

======

当内存出现不够用的时候,oom-killer会kill掉一些进程

这个信息可以在/var/log/messger里查到

 

当这种情况出现的时候,可以将系统里一些nattach为0的shm清理掉

 

 

关于高低端内存的问题可以看如下

Since this problem seems to popup on different lists, this message has
been cross-posted to the general Red Hat discussion list, the RHEL3
(Taroon) list and the RHEL4 (Nahant) list. My apologies for not having
the time to post this summary sooner.

I would still be banging my head against this problem were it not for
the generous assistance of Tom Sightler <ttsig@xxxxxxxxxxxxx> and Brian
Long <brilong@xxxxxxxxx>.

In general, the out of memory killer (oom-killer) begins killing
processes, even on servers with large amounts (6Gb+) of RAM. In many
cases people report plenty of "free" RAM and are perplexed as to why the
oom-killer is whacking processes. Indications that this has happened
appear in /var/log/messages:
Out of Memory: Killed process [PID] [process name].

In my case I was upgrading various VMware servers from RHEL3 / VMware
GSX to RHEL4 / VMware Server. One of the virtual machines on a server
with 16Gb of RAM kept getting whacked by the oom-killer. Needless to
say, this was quite frustrating.

As it turns out, the problem was low memory exhaustion. Quoting Tom:
"The kernel uses low memory to track allocations of all memory thus a
system with 16GB of memory will use significantly more low memory than a
system with 4GB, perhaps as much as 4 times. This extra pressure
happens from the moment you turn the system on before you do anything at
all because the kernel structures have to be sized for the potential of
tracking allocations in four times as much memory."

You can check the status of low & high memory a couple of ways:

# egrep 'High|Low' /proc/meminfo
HighTotal: 5111780 kB
HighFree: 1172 kB
LowTotal: 795688 kB
LowFree: 16788 kB

# free -lm
total used free shared buffers cached
Mem: 5769 5751 17 0 8 5267
Low: 777 760 16 0 0 0
High: 4991 4990 1 0 0 0
-/+ buffers/cache: 475 5293
Swap: 4773 0 4773

When low memory is exhausted, it doesn't matter how much high memory is
available, the oom-killer will begin whacking processes to keep the
server alive.

There are a couple of solutions to this problem:

If possible, upgrade to 64-bit Linux. This is the best solution because
*all* memory becomes low memory. If you run out of low memory in this
case, then you're *really* out of memory. ;-)

If limited to 32-bit Linux, the best solution is to run the hugemem
kernel. This kernel splits low/high memory differently, and in most
cases should provide enough low memory to map high memory. In most
cases this is an easy fix - simply install the hugemem kernel RPM &
reboot.

If running the 32-bit hugemem kernel isn't an option either, you can try
setting /proc/sys/vm/lower_zone_protection to a value of 250 or more.
This will cause the kernel to try to be more aggressive in defending the
low zone from allocating memory that could potentially be allocated in
the high memory zone. As far as I know, this option isn't available
until the 2.6.x kernel. Some experimentation to find the best setting
for your environment will probably be necessary. You can check & set
this value on the fly via:
# cat /proc/sys/vm/lower_zone_protection
# echo "250" > /proc/sys/vm/lower_zone_protection

To set this option on boot, add the following to /etc/sysctl.conf:
vm.lower_zone_protection = 250

As a last-ditch effort, you can disable the oom-killer. This option can
cause the server to hang, so use it with extreme caution (and at your
own risk)!
Check status of oom-killer:
# cat /proc/sys/vm/oom-kill

Turn oom-killer off/on:
# echo "0" > /proc/sys/vm/oom-kill
# echo "1" > /proc/sys/vm/oom-kill

To make this change take effect at boot time, add the following
to /etc/sysctl.conf:
vm.oom-kill = 0

For processes that would have been killed, but weren't because the oom-
killer is disabled, you'll see the following message
in /var/log/messages:
"Would have oom-killed but /proc/sys/vm/oom-kill is disabled"

Sorry for being so long-winded. I hope this helps others who have
struggled with this problem