Nagios的进程监控及eventhandler之实战

Nagios自身是不带任何功能的，要实现监控功能，我们必须安装插件（plugins），幸好，现在只要安装一个通用的plugin，就能实现大部分的

监控功能。
以下是安装plugin的步骤，一般在正常安装的时候都已经装上了。
wget http://osdn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.7.tar.gz
cd ~/downloads
tar xzf nagios-plugins-1.4.7.tar.gz
cd nagios-plugins-1.4.7
./configure –with-nagios-user=nagios –with-nagios-group=nagios
make
make install

这时候在/usr/local/nagios/下就会出现一个libexec文件夹，有很多自带的功能程序。我们自定义的功能，也可以放在这个文件夹下面。
较常用的有chekc_http,check_disk，等等，如果我们想了解其用法，可以用./check_http -h的命令。
这一次我要实现的功能，是监控某个进程是否还存在，如果不存在了，就执行某个shell，使之启动。在此处，要用到check_procs和

eventhandler，与eventhandler例子中的检查apache的状态略有不同，故记录下来，与有需要的朋友分享。

首先，因为check_procs要去检查进程，所以我们要给他root的权限。
chown root check_procs
chmod 555 check_procs
接下来运行：
./check_procs
(应该显示为: PROCS OK: XX processes)
我用ps aux|grep cypress，查找我想监控的进程，里面带有/usr/local/cypressTemp/javasdk/bin/java

再用./check_procs -h查看check_procs的用法。以下是几个sampler。
Examples:
check_procs -w 2:2 -c 2:1024 -C portsentry
Warning if not two processes with command name portsentry.
Critical if < 2 or > 1024 processes

check_procs -w 10 -a ‘/usr/local/bin/perl’ -u root
Warning alert if > 10 processes with command arguments containing
‘/usr/local/bin/perl’ and owned by root

check_procs -w 50000 -c 100000 –metric=VSZ
Alert if vsz of any processes over 50K or 100K

check_procs -w 10 -c 20 –metric=CPU
Alert if cpu of any processes over 10%% or 20%%

根据以上的sampler,我就能写出service和command了。
define service{
        host_name               localhost
        service_description     check-cypress
        check_command           check_cypress
        event_handler       restart-cypress
        max_check_attempts      4
        normal_check_interval   3
        retry_check_interval    2
        check_period            24×7
       }
define command{
        command_name    check_cypress
        command_line    $USER1$/check_procs -c 1:1 -a ‘/usr/local/cypressTemp/javasdk/bin/java’
        }
check_procs -c 1:1 -a ‘/usr/local/cypressTemp/javasdk/bin/java’的意思为：如果进程中没有一条包

含’/usr/local/cypressTemp/javasdk/bin/java’，则状态为critical(c)。
因为我们还需要对这个做出动作，所以要用到eventhandler。
define command{
    command_name    restart-cypress
    command_line    /usr/local/nagios/libexec/eventhandlers/restart-cypress $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
    }
然后去/usr/local/nagios/libexec/eventhandlers/ 编辑restart-cypress 文件，如下：

#!/bin/sh
#
# Event handler script for restarting the cypress server on the local machine
#
#

# What state is the search service in?
/usr/bin/printf "We enter" >> /usr/local/nagios/var/Cypress.log
yourdate=`date   +%Y%m%d%H%M%S`
case "$1" in
OK)

    # The service just came back up, so don’t do anything…
    /usr/bin/printf "check OK $yourdate \n" >> /usr/local/nagios/var/Cypress.log
    ;;
CRITICAL)
    # Aha! The cypress service appears to have a problem – perhaps we should restart the server…
    /usr/bin/printf " The cypress service appears to have a problem $yourdate \n " >> /usr/local/nagios/var/Cypress.log
    /usr/local/cypressTemp/runsearch.sh
    ;;
esac
exit 0

以上的shell文件很简单，就是运行一个已经写好的shell脚本。这个例子很简单，如果学习过www.nagios.org上的eventhandler例子后，会非

常容易理解，我把完整的过程写出来，只是想提供一个实际的例子给各位朋友参考。

posted on 2012-05-09 16:56 草原和大树阅读(3157) 评论(0) 编辑收藏举报