ganglia自定义图表

http://blog.163.com/digoal@126/blog/static/1638770402014815105622529/

上一篇简单的将了一些gweb的聚合和比较视图.

那些视图都是临时创建的, gweb还支持配置自定义视图标签, 方便随时使用.

其实功能和gweb的聚合和比较视图类似, 只是把这些标签化了, 方便随时来使用.

创建视图的规则 :

Defining views using JSON

Views are stored as JSON files in the conf_dir directory. The default for the

conf_diris /var/lib/ganglia/conf. You can change that by specifying an alternate

directory in conf.php:

$conf['conf_dir'] = "/var/www/html/conf";

首先要搞清楚配置文件放在哪里 :

当前的配置如下 :

[root@db-172-16-3-221 etc]# cd /data01/web/ganglia-web/

[root@db-172-16-3-221 ganglia-web]# less conf.php

# Where to store web-based configuration

$conf['gweb_confdir'] = "/data01/web/ganglia-web";

$conf['views_dir'] = $conf['gweb_confdir'] . '/conf';

$conf['conf_dir'] = $conf['gweb_confdir'] . '/conf';

所以视图的配置文件是存在/data01/web/ganglia-web/conf目录下的.

然后搞清楚命名规则 :

文件名为view_?.json, ?表示视图名, 必须唯一.

You can create or edit existing files. The filename for the view must start with

view_and end with .json(as in, view_1.jsonor view_jira_servers.json). It must be

unique. Here is an example definition of a view that will result with a view with

three different graphs:

配置文件样板 :

view_jira.json

{

"view_name":"jira", # 视图名称

"items":[ # 视图里包含的东西, 配置为列表形式, 如下

{ "hostname":"web01.domain.com","graph":"cpu_report"}, # 图表1, 单台主机的单个graph. 在graph.d目录中定义具体内容

{ "hostname":"web02.domain.com","graph":"load_report"}, # 图表2, 单台主机的单个graph. 在graph.d目录中定义具体内容

{ "hostname":"web03.domain.com","metric":"cpu_aidle"}, # 图表3, 单台主机的单个metric

{ "aggregate_graph":"true", # 图表4, 聚合视图

"host_regex":[ # 主机规则表达式列表

{"regex":"web[2-7]"},

{"regex":"web50"}

"metric_regex":[ # metric规则表达式列表

{"regex":"load_one"}

"graph_type":"stack", # 聚合视图的画图类型, line OR stack.

"title":"Location Web Servers load" # 聚合视图的title

{....} # 图表5, .....

], # 列表结束

"view_type":"standard" # 视图类型

}

注意items里面的graph的配置位置在graph.d目录这里 :

/data01/web/ganglia-web/graph.d

[root@db-172-16-3-221 graph.d]# ll

total 80

-rw-r--r-- 1 nobody 1000 578 Jan 6 2014 apache_report.json

-rw-r--r-- 1 nobody 1000 186 Jan 6 2014 apache_response_report.json

-rw-r--r-- 1 nobody 1000 426 Jan 6 2014 cpu_report.json

-rw-r--r-- 1 nobody 1000 9065 Jan 6 2014 cpu_report.php

-rw-r--r-- 1 nobody 1000 476 Jan 6 2014 load_all_report.json

-rw-r--r-- 1 nobody 1000 590 Jan 6 2014 load_report.json

-rw-r--r-- 1 nobody 1000 680 Jan 6 2014 mem_report.json

-rw-r--r-- 1 nobody 1000 8324 Aug 4 03:04 mem_report.php

-rw-r--r-- 1 nobody 1000 4973 Jan 6 2014 metric.php

-rw-r--r-- 1 nobody 1000 355 Jan 6 2014 network_report.json

-rw-r--r-- 1 nobody 1000 1616 Jan 6 2014 nfs_v3_client_report.json

-rw-r--r-- 1 nobody 1000 358 Jan 6 2014 packet_report.json

-rw-r--r-- 1 nobody 1000 6392 Jan 6 2014 sample_report.php

-rw-r--r-- 1 nobody 1000 3543 Jan 6 2014 varnish_report.php

举个例子 :

[root@db-172-16-3-221 graph.d]# cat cpu_report.json

{

"report_name" : "cpu_report",

"report_type" : "template",

"title" : "CPU Report",

"graphite" : "target=alias(HOST_CLUSTER.cpu_user.sum,'User')&target=alias(HOST_CLUSTER.cpu_nice.sum%2C'Nice')&target=alias(HOST_CLUSTER.cpu_system.sum,'System')&target=alias(HOST_CLUSTER.cpu_wio.sum,'Wait')&target=alias(HOST_CLUSTER.cpu_idle.sum%2C'Idle')&areaMode=stacked&max=100&colorList=3333bb,ffea00,dd0000,ff8a60,e2e2f2"

}

[root@db-172-16-3-221 graph.d]# cat cpu_report.php

<?php

/* Pass in by reference! */

function graph_cpu_report( &$rrdtool_graph )

{

global $conf,

$context,

$range,

$rrd_dir,

$size;

if ($conf['strip_domainname']) {

$hostname = strip_domainname($GLOBALS['hostname']);

} else {

$hostname = $GLOBALS['hostname'];

}

$title = 'CPU';

$rrdtool_graph['title'] = $title;

$rrdtool_graph['upper-limit'] = '100';

$rrdtool_graph['lower-limit'] = '0';

$rrdtool_graph['vertical-label'] = 'Percent';

$rrdtool_graph['height'] += ($size == 'medium') ? 28 : 0;

$rrdtool_graph['extras'] = ($conf['graphreport_stats'] == true) ? ' --font LEGEND:7' : '';

$rrdtool_graph['extras'] .= " --rigid";

if ( $conf['graphreport_stats'] ) {

$rrdtool_graph['height'] += ($size == 'medium') ? 16 : 0;

$rmspace = '\\g';

} else {

$rmspace = '';

}

$series = '';

// RB: Perform some formatting/spacing magic.. tinkered to fit

$eol1 = '';

$space1 = '';

$space2 = '';

if ($size == 'small') {

$eol1 = '\\l';

$space1 = ' ';

$space2 = ' ';

} else if ($size == 'medium' || $size == 'default') {

$eol1 = '';

$space1 = ' ';

$space2 = '';

} else if ($size == 'large') {

$eol1 = '';

$space1 = ' ';

$space2 = ' ';

}

$cpu_nice_def = '';

$cpu_nice_cdef = '';

if (file_exists("$rrd_dir/cpu_nice.rrd")) {

$cpu_nice_def = "DEF:'cpu_nice'='${rrd_dir}/cpu_nice.rrd':'sum':AVERAGE ";

$cpu_nice_cdef = "CDEF:'ccpu_nice'=cpu_nice,num_nodes,/ ";

}

if ($context != "host" ) {

$series .= "DEF:'num_nodes'='${rrd_dir}/cpu_user.rrd':'num':AVERAGE ";

}

$series .= "DEF:'cpu_user'='${rrd_dir}/cpu_user.rrd':'sum':AVERAGE "

. $cpu_nice_def

. "DEF:'cpu_system'='${rrd_dir}/cpu_system.rrd':'sum':AVERAGE "

. "DEF:'cpu_idle'='${rrd_dir}/cpu_idle.rrd':'sum':AVERAGE ";

if (file_exists("$rrd_dir/cpu_wio.rrd")) {

$series .= "DEF:'cpu_wio'='${rrd_dir}/cpu_wio.rrd':'sum':AVERAGE ";

}

if (file_exists("$rrd_dir/cpu_steal.rrd")) {

$series .= "DEF:'cpu_steal'='${rrd_dir}/cpu_steal.rrd':'sum':AVERAGE ";

}

if (file_exists("$rrd_dir/cpu_sintr.rrd")) {

$series .= "DEF:'cpu_sintr'='${rrd_dir}/cpu_sintr.rrd':'sum':AVERAGE ";

}

if ($context != "host" ) {

$series .= "CDEF:'ccpu_user'=cpu_user,num_nodes,/ "

. $cpu_nice_cdef

. "CDEF:'ccpu_system'=cpu_system,num_nodes,/ "

. "CDEF:'ccpu_idle'=cpu_idle,num_nodes,/ ";

if (file_exists("$rrd_dir/cpu_wio.rrd")) {

$series .= "CDEF:'ccpu_wio'=cpu_wio,num_nodes,/ ";

}

if (file_exists("$rrd_dir/cpu_sintr.rrd")) {

$series .= "CDEF:'ccpu_sintr'=cpu_sintr,num_nodes,/ ";

}

if (file_exists("$rrd_dir/cpu_steal.rrd")) {

$series .= "CDEF:'ccpu_steal'=cpu_steal,num_nodes,/ ";

}

$plot_prefix ='ccpu';

} else {

$plot_prefix ='cpu';

}

$series .= "AREA:'${plot_prefix}_user'#${conf['cpu_user_color']}:'User${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:user_pos=${plot_prefix}_user,0,INF,LIMIT "

. "VDEF:user_last=user_pos,LAST "

. "VDEF:user_min=user_pos,MINIMUM "

. "VDEF:user_avg=user_pos,AVERAGE "

. "VDEF:user_max=user_pos,MAXIMUM "

. "GPRINT:'user_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'user_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'user_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'user_max':'${space1}Max\:%5.1lf%%\\l' ";

}

if (file_exists("$rrd_dir/cpu_nice.rrd")) {

$series .= "STACK:'${plot_prefix}_nice'#${conf['cpu_nice_color']}:'Nice${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:nice_pos=${plot_prefix}_nice,0,INF,LIMIT "

. "VDEF:nice_last=nice_pos,LAST "

. "VDEF:nice_min=nice_pos,MINIMUM "

. "VDEF:nice_avg=nice_pos,AVERAGE "

. "VDEF:nice_max=nice_pos,MAXIMUM "

. "GPRINT:'nice_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'nice_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'nice_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'nice_max':'${space1}Max\:%5.1lf%%\\l' ";

}

$series .= "STACK:'${plot_prefix}_system'#${conf['cpu_system_color']}:'System${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:system_pos=${plot_prefix}_system,0,INF,LIMIT "

. "VDEF:system_last=system_pos,LAST "

. "VDEF:system_min=system_pos,MINIMUM "

. "VDEF:system_avg=system_pos,AVERAGE "

. "VDEF:system_max=system_pos,MAXIMUM "

. "GPRINT:'system_last':'${space1}Now\:%5.1lf%%' "

. "GPRINT:'system_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'system_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'system_max':'${space1}Max\:%5.1lf%%\\l' ";

}

if (file_exists("$rrd_dir/cpu_wio.rrd")) {

$series .= "STACK:'${plot_prefix}_wio'#${conf['cpu_wio_color']}:'Wait${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:wio_pos=${plot_prefix}_wio,0,INF,LIMIT "

. "VDEF:wio_last=wio_pos,LAST "

. "VDEF:wio_min=wio_pos,MINIMUM "

. "VDEF:wio_avg=wio_pos,AVERAGE "

. "VDEF:wio_max=wio_pos,MAXIMUM "

. "GPRINT:'wio_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'wio_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'wio_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'wio_max':'${space1}Max\:%5.1lf%%\\l' ";

}

if (file_exists("$rrd_dir/cpu_steal.rrd")) {

$series .= "STACK:'${plot_prefix}_steal'#${conf['cpu_steal_color']}:'Steal${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:steal_pos=${plot_prefix}_steal,0,INF,LIMIT "

. "VDEF:steal_last=steal_pos,LAST "

. "VDEF:steal_min=steal_pos,MINIMUM "

. "VDEF:steal_avg=steal_pos,AVERAGE "

. "VDEF:steal_max=steal_pos,MAXIMUM "

. "GPRINT:'steal_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'steal_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'steal_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'steal_max':'${space1}Max\:%5.1lf%%\\l' ";

}

if (file_exists("$rrd_dir/cpu_sintr.rrd")) {

$series .= "STACK:'${plot_prefix}_sintr'#${conf['cpu_sintr_color']}:'Sintr${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:sintr_pos=${plot_prefix}_sintr,0,INF,LIMIT "

. "VDEF:sintr_last=sintr_pos,LAST "

. "VDEF:sintr_min=sintr_pos,MINIMUM "

. "VDEF:sintr_avg=sintr_pos,AVERAGE "

. "VDEF:sintr_max=sintr_pos,MAXIMUM "

. "GPRINT:'sintr_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'sintr_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'sintr_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'sintr_max':'${space1}Max\:%5.1lf%%\\l' ";

}

$series .= "STACK:'${plot_prefix}_idle'#${conf['cpu_idle_color']}:'Idle${rmspace}' ";

if ( $conf['graphreport_stats'] ) {

$series .= "CDEF:idle_pos=${plot_prefix}_idle,0,INF,LIMIT "

. "VDEF:idle_last=idle_pos,LAST "

. "VDEF:idle_min=idle_pos,MINIMUM "

. "VDEF:idle_avg=idle_pos,AVERAGE "

. "VDEF:idle_max=idle_pos,MAXIMUM "

. "GPRINT:'idle_last':' ${space1}Now\:%5.1lf%%' "

. "GPRINT:'idle_min':'${space1}Min\:%5.1lf%%${eol1}' "

. "GPRINT:'idle_avg':'${space2}Avg\:%5.1lf%%' "

. "GPRINT:'idle_max':'${space1}Max\:%5.1lf%%\\l' ";

}

// If metrics like cpu_user and wio are not present we are likely not collecting them on this

// host therefore we should not attempt to build anything and will likely end up with a broken

// image. To avoid that we'll make an empty image

if ( !file_exists("$rrd_dir/cpu_wio.rrd") && !file_exists("$rrd_dir/cpu_user.rrd") )

$rrdtool_graph[ 'series' ] = 'HRULE:1#FFCC33:"No matching metrics detected"';

else

$rrdtool_graph[ 'series' ] = $series;

return $rrdtool_graph;

}

自定义?_report.json可以参考

http://blog.163.com/digoal@126/blog/static/163877040201481535636763/

view配置对应的json文件内支持的语法如下 :

Key: Value

view_name: Name of the view, which must be unique.

view_type: Standard or Regex. Regex view allows you to specify regex to match hosts.

items: An array of hashes describing which metrics should be part of the view.

hostname: Hostname of the host that we want metric/graph displayed.

metric: Name of the metric, such as load_one.

graph: Graph name, such as cpu_report or load_report. You can use metric or graph keys but not both.

aggregate_graph: If this value exists and is set to true, the item defines an aggregate graph.

This item needs a hash of regular expressions and a description.

warning: (Optional) Adds a vertical yellow line to provide visual cue for a warning state.

critical: (Optional) Adds a vertical red line to provide visual cue for a critical state.

例子 :

# cd /data01/web/ganglia-web/conf

[root@db-172-16-3-221 conf]# ll

-rw-r--r-- 1 nobody 1000 58 Jan 6 2014 view_default.json

[root@db-172-16-3-221 conf]# cat view_default.json

{"view_name":"default","items":[],"view_type":"standard"}

view_default.json是一个没有列表的配置文件,

ganglia gweb customize views - 德哥@Digoal - PostgreSQL research

我们修改一下它的配置 :

{

"view_name":"default",

"items":[

{ "aggregate_graph":"true",

"host_regex":[

{"regex":".*"} # 因为我这里只有一台主机, 所以暂且这样吧

"metric_regex":[

{"regex":"^load.*"}

"graph_type":"line",

"title":"Location Web Servers load"

}

"view_type":"standard"

}

修改后, 立即生效.

例如再新增一个view_digoal.json

[root@db-172-16-3-221 conf]# vi view_digoal.json

{

"view_name":"digoal",

"items":[

{ "aggregate_graph":"true",

"host_regex":[

{"regex":".*"}

"metric_regex":[

{"regex":"^cpu.*"}

"graph_type":"line",

"title":"Cpu"

{ "aggregate_graph":"true",

"host_regex":[

{"regex":".*"}

"metric_regex":[

{"regex":"^mem.*"}

"graph_type":"line",

"title":"Mem"

}

"view_type":"standard"

}

这里同样说明了gmond hostname和metric name的命名规则重要性.

http://blog.csdn.net/cloudeep/article/details/5669295

Ganglia 扩展之 Python 实现方法

--作者：Terry，Schubert

1. Ganglia 简介

Ganglia 是 UC Berkeley 发起的一个开源监视项目，设计用于测量数以千计的节点。每台计算机都运行一个收集和发送度量数据（如处理器速度、内存使用量等）的名为 gmond 的守护进程。它将从操作系统和指定主机中收集。接收所有度量数据的主机可以显示这些数据并且可以将这些数据的精简表单传递到层次结构中。正因为有这种层次结构模式，才使得 Ganglia 可以实现良好的扩展。 gmond 带来的系统负载非常少，这使得它成为在集群中各台计算机上运行的一段代码，而不会影响用户性能。

所有这些数据收集会多次影响节点性能。网络中的 “ 抖动（ Jitter ） ” 发生在大量小消息同时出现时。我们发现通过将节点时钟保持一致，就可以避免这个问题。

2. Ganglia 扩展能力

基本 Ganglia 安装已经给我们提供了大量有用信息。使用 Ganglia 的插件将给我们提供两种添加更多功能的方法：

通过添加带内（ in-band ）插件。
通过添加一些其他来源的带外（ out-of-band ）欺骗。

Ganglia 安装启动部分参照文档尾部的参考资料，本文档主要讲解 Ganglia 扩展方法的带内 Python 插件实现。

3. 系统准备

实验环境：

l 机器：

n 机型： DELL OPTIPLEX 755

n 操作系统： Linux 2.6.18-164.15.1.el5.centos.plus #1 SMP Wed Mar 17 19:54:20 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

n 内存： 2G

l Ganglia 部署环境

n Ganglia 根目录： /usr/local/ganglia/ $GANGLIA_ROOT

n Ganglia 配置文件目录： /etc/ganglia/ $GANGLIA_CONF

n Ganglia RRDTool 目录： /var/lib/ganglia/rrds/ $GANGLIA_RRDS

n Ganglia html 目录： /var/www/html/ganglia/ $GANGLIA_WEB

为了方便描述，将采用 $GANGLIA_*** 代表相关环境。

如果安装成功，可发现 $GANGLIA_ROOT/lib64/ganglia/modpython.so ，该文件是 Ganglia Python 扩展的动态库文件，若不存在，则无法支持 python 扩展。

4. Python 扩展的实现

1) 实例描述

我们以实现一个 random_module ， random 模块中有两个 metric ： random1 和 random2 。我们限定 random的取值为 [RandMin,RandomMax], 其中 random1+random2 互补，即

random2 = RandomMin + RandomMax - random1

2) 需要做的工作

为实现该模块需要做的工作如下：

l 修改配置文件，添加扩展的模块

l 编写扩展模块 Python 代码

l 增加扩展模块的统计表（可省略）

3) 修改配置文件

l gmond.conf 文件

修改 $GANGLIA_CONF/gmond.conf 文件，操作如下：

*********************** Start ****************************

…

modules {

…..

module {

name = "sys_module"

path = "modsys.so"

}

/* 添加 python 主模块 */

module {

name = "python_module"

/* 动态库路径，完整路径为 $GANGLIA_ROOT/lib64/ganglia/modpython.so */

path = "modpython.so"

/* Python 扩展模块代码存放目录，不存在则创建 */

params="/etc/ganglia/python_modules/"

}

include ('/etc/ganglia/conf.d/*.conf')

/* /etc/ganglia/conf.d/ 为 python 扩展模块配置文件存放目录，不存在则创建， gmond 启动时，会 load 所有的配置文件和 python 模块代码

include ('/etc/ganglia/conf.d/*.pyconf')

….

*********************** End ****************************

我们将用 $GANGLIA_PY_CODE 和 $GANGLIA_PY_CONF 来表示 Python 扩展模块的代码和配置文件存放的目录

l random_module.pyconf 文件

vi $GANGLIA_PY_CONF/random_module.pyconf

random_module.pyconf 内容如下：

*********************** Start ****************************

modules {

module {

/* 模块名 $PY_MODULE ，创建 Python 文件路径为 $GANGLIA_PY_CODE/$PY_MODULE.py */

name = "random_module"

language = "python"

/* 参数列表，所有的参数作为一个 dict( 即 map) 传给 python 脚本的 metric_init(params) 函数。

本例中， metric_init 调用时， params={“RandomMax”:”10”,”RandomMin”:”0”}

param RandomMax{

value = 10

}

param RandomMin{

value = 0

}

/* 需要收集的 metric 列表，一个模块中可以扩展任意个 metric

本例中，我们收集的 metric 为 random1 和 random2.

Title 的内容，作为 metric 图的标题

collection_group {

/* 汇报周期

可选参数：

collect_once – Specifies that the group of static metrics

collect_every – Collection interval (only valid for non-static)

time_threshold – Max data send interval

collect_every = 10 /* 10 s 汇报一次 */

time_threshold = 50

metric {

name = "random1"

title = "test random1" /* Metric name (see “gmond –m”) */

value_threshold = 50 /* Metric variance threshold (send if exceeded) */

}

metric {

name="random2"

title = "test random2"

value_threshold = 50

}

*********************** End ****************************

4) 编写模块代码

Ganglia 模块扩展时， Python 脚本主要要实现的函数有：

l metric_init(params):

§ Called once at module initialization time

§ Must return a metric description dictionary or list of dictionaries

Metric definition data dictionary ：

d = { ‘name’ : ‘<your_metric_name>’,

‘ call_back’ : <call_back function>,

‘ time_max’ : int(<your_time_max>),

‘ value_type’ : ‘<string | uint | float | double>’,

‘ units’ : ’<your_units>’,

‘ slope’ : ‘<zero | positive | negative | both>’,

‘ format’ : ‘<your_format>’,

‘ description’ : ‘<your_description>’}

Can be a single dictionary or a list of dictionaries

Must be returned from the metric_init() function

§ Any other module initialization can also take place here

l metric_handler() – may have multiple handlers

§ Metric gathering handler

§ Must return a single data value of the same type as specified in the metric_init() function

l metric_cleanup()

§ Called once at module termination time

§ Does not return a value

vi $GANGLIA_PY_CODE /random_module.py

random_module.py 的代码如下：

*********************** Start ****************************

[python] view plain copy

import random
random_max = 100
random_min = 0
v = 0
def random1_handler(name):
global v,random_max,random_min
v = random.randint(random_min,random_max)
return v
def random2_handler(name):
global v,random_max,random_min
return random_min+random_max-v
def metric_init(params):
global random_max,random_min
if params:
if params.has_key("RandomMin"):
random_min = int(params["RandomMin"])
if params.has_key("RandomMax"):
random_max = int(params["RandomMax"])
tmp = {'name':'random1','call_back':random1_handler,
'value_type':'uint','units':'usage',
'slope':'both','format':'%u',
'description':'test random plugin',
'groups':'random'}
descriptors = [tmp]
tmp1 = {'name':'random2','call_back':random2_handler,
'value_type':'uint','units':'usage',
'slope':'both','format':'%u',
'description':'test subs plugin',
'groups':'random'}
descriptors.append(tmp1)
return descriptors
def metric_cleanup():
pass
if __name__=='__main__':
descriptors = metric_init(None)
for d in descriptors:
print "value for %s is %d"%(d['name'],d['call_back'](d['name']))

*********************** End ****************************

完成以上步骤后，重启 gmond ，就可以在 web 界面的节点视图上看到新添的 random1 和 random2 的统计图表

截图如下：

5) 增加统计表

通过以上扩展，我们增加了自定义的 metric ，这些 metric 各自以图表的形式展现，为了方便查找问题，需要将若干 metric 结合起来，显示在同一个表中，这时我们就需要增加相应的统计表展现。该部分实现仅需修改 Ganglia 的web 代码。这里，我们以扩展的 random 模块作为例子，将 random1 和 random2 两个 metric 画在同一张表中。

$GANGLIA_WEB 目录文件组及相关文件说明见附录 1.

为此我们需要做的工作：

l 修改 $GANGLIA_WEB/conf.php 文件，添加要显示的统计表名称 $GRAPH_NAME ，本例中为“ random ”；

l 编写 $GANGLIA_WEB/graph.d/{$GRAPH_NAME}_report.php 文件，并实现函数：

function graph_{$GRAPH_NAME}_report ( &$rrdtool_graph )

l 重启 httpd 服务。

a) 修改 $GANGLIA_WEB/conf.php 文件

Vi $GANGLIA_WEB/conf.php

************************* Start **********************

…

# Colors for the load ranks.

$load_colors = array(

"100+" => "ff634f",

"75-100" =>"ffa15e",

"50-75" => "ffde5e",

"25-50" => "caff98",

"0-25" => "e2ecff",

"down" => "515151"

);

# 添加我们自定义 metric 的颜色，也可以在 random_report.php 中定义

# Colors for the random report graph

$random1_color = "0000FF";

$random2_color = "FF0000";

…

# Default metric

$default_metric = "load_one";

# Optional summary graphs

#$optional_graphs = array('packet');

# 需要添加的统计表名

$optional_graphs = array('random');

…

************************ End *************************

b) 编写 $GANGLIA_WEB/graph.d/random_report.php 文件

我们可以参考 $GANGLIA_WEB/graph.d/ 目录下已经实现的内容。

Vi $GANGLIA_WEB/graph.d/random_report.php

************************ Start *************************

[php] view plain copy

<?php
function graph_random_report ( &$rrdtool_graph ) {
global $context, //声明全局变量
$random1_color,
$random2_color,
$hostname,
$range,
$rrd_dir, // 即$GANGLIA_RRDS
$size;
//
// You *MUST* set at least the 'title', 'vertical-label', and 'series' variables.
// Otherwise, the graph *will not work*.
//
$title = 'random';
if ($context != 'host') { //判断是否为节点视图
$rrdtool_graph['title'] = $title; // 设置图表的标题
} else {
$rrdtool_graph['title'] = "$hostname $title last $range";
}
$rrdtool_graph['vertical-label'] = 'usage'; // Y坐标标题
$rrdtool_graph['height'] += $size == 'medium' ? 28 : 0 ; // Fudge to account for number of lines in the chart legend
$rrdtool_graph['upper-limit'] = '10';
$rrdtool_graph['lower-limit'] = '0';
$rrdtool_graph['extras'] = '--rigid';
if($context != "host" ) {
$series = "DEF:'num_nodes'='${rrd_dir}/cpu_user.rrd':'num':AVERAGE "
. "DEF:'random1'='${rrd_dir}/random1.rrd':'sum':AVERAGE"
. "CDEF:'crandom1'=random1,num_nodes,/ "
. "DEF:'random2'='${rrd_dir}/random2.rrd':'sum':AVERAGE"
. "CDEF:'crandom2'=random2,num_nodes,/ "
. "LINE2:'crandom1'#$random1_color:'random1' "
. "LINE2:'crandom2'#$random2_color:'random2' ";
}
else
{
$series = "DEF:'random1'='${rrd_dir}/random1.rrd':'sum':AVERAGE "
. "DEF:'random2'='${rrd_dir}/random2.rrd':'sum':AVERAGE"
. "LINE2:'random1'#$random1_color:'random1' "
. "LINE2:'random2'#$random2_color:'random2' ";
}
// We have everything now, so add it to the array, and go on our way.
$rrdtool_graph['series'] = $series;
return $rrdtool_graph;
}
?>

************************ End *************************

摘录《 Custom Graphs in Ganglia 3.1.x 》，相关说明如下：

Set values for the following hash keys:

`$rrdtool_graph['title']`	This will be used as the "title" of the graph.
`$rrdtool_graph['vertical_label']`	This will set the label for the Y-axis on the chart (there is no corresponding X-axis key, since it is always "time.)
`$rrdtool_graph['series']`	Commands to actually generate the `rrdtool` graph are here. This is a string variable, so care must be taken to properly format and space each command. Developers may find it useful to create a temporary array and `push()` commands into it, then `implode()` them into a single string.

Set other variables as desired. Setting the "upper-limit" and "lower-limit" keys can be useful to clamp a chart to a fixed range in the Y-axis. For example, if you are monitoring a percentage, and always want the low and high values to be 0 and 100, respectively. This can also be used to ignore values outside the norm that would otherwise cause rrdtool to chose an inappropriate range; basically, cheap spike removal.

The most difficult part of generating the graph is properly setting the $rrdtool_graph['series'] value. It is suggested that you experiment first on the command line, using rrdtool directly, then convert that into a set of PHP statements. The PHP code can be as simple or complicated as required.

There are many variables are pre-defined and available for use in custom graphs. Users are encouraged to make use of these, although changing the values inside the report PHP file is notrecommended. Variables are imported into the scope of the PHP file using the global PHP function (yes, it's ugly, we know). A list of the more commonly used variables is:

$context (e.g. "host", "cluster", "meta", etc)	*$cpu__color** (see list in conf.php)
$hostname (set to the current hostname, as known to gmetad)	$load_one_color
$range (time range of the graph, usually "hour", "week", "month", or "year")	$load_colors (assoc. array that stores colors for the CPU report. Valid keys are: "down", "0-25", "25-50", "50-75", "75-100", "100+")
$rrd_dir (Appropriate filesystem directory for the RRD file in question. Different contexts (host/cluster/meta) will be handled correctly. Use this instead of hardcoding paths to RRD files.)	*$mem__color (similar to $cpu_***color, above)
$size (Current size of the graph, usually "small", "medium", "large", etc)	$strip_domainname (should the "shortname" of the host be used, instead of the FQDN?)

Penultimately, any value in the $rrdtool_graph['extras'] key will be passed, verbaitim, to rrdtool after all other keys, but before various data definition, calculation, graphing and printing elements. This key is essentially a way for the developer to add any other rrdtool options that are desired, and make a last-ditch effort to override other settings.

And lastly, the $rrdtool_graph variable should be the return value of the function; Ganglia will take care of the rest!

#rrdtool_graph_keys?

Keys present in the $rrdtool_graph associative array.

A list of keys in $rrdtool_graph that are used:

$series (string: holds the meat of the rrdgraph definition. REQUIRED!)

//见参考文档《RRD 数据库及RRDTool 简介》

    $title           (string: title of the report. REQUIRED!) 

    $vertical_label  (label for Y-Axis. REQUIRED!)



    $start           (string: Start time of the graph, can usually be

                              left alone)

    $end             (string: End time of the graph, also can usually be

                              left alone)



    $width           (strings: Width and height of *graph*, the actual image

    $height                    will be slightly larger due to text elements

                               and padding.  These are normally set

                               automatically, depending on the graph size

                               chosen from the web UI)



    $upper-limit     (strings: Maximum and minimum Y-value for the graph.

    $lower-limit               RRDTool normally will auto-scale the Y min

                               and max to fit the data.  You may override

                               this by setting these variables to specific

                               limits.  The default value is a null string,

                               which will force the auto-scale behavior)



    $color           (array: Sets one or more chart colors.  Usually used

                             for setting the background color of the chart.

                             Valid array keys are BACK, CANVAS, SHADEA,

                             SHADEB, FONT, FRAME and ARROW.  Usually,

                             only BACK is set, and only rarely at that.)



    $extras          (Any other custom rrdtool commands can be added to this

                      variable.  For example, setting a different --base

                      value or use a --logarithmic scale)



做完以上操作，重启
httpd

，操作如下：

Service httpd restart

我们可以在 cluster 视图中看到 random 统计图表，截图如下：

但在节点视图上看不到该图表，尽管 $GANGLIA_WEB/graph.d/random_report.php 中针对集群视图和节点视图都做了处理。

6) Ganglia 自定义 Web 模板

检查 web 执行流程，我们发现 $GANGLIA_WEB/host_view.php 以及 $GANGLIA_WEB/ templates/default/host_view.tpl 中均没有自定义统计表显示的代码。为了保证默认模板不变，我们建立新的模板目录 $GANGLIA_WEB/ templates/onest, 并将 default 的内容拷贝到该目录下。

a) 修改 $GANGLIA_WEB/host_view.php

在文件尾部 “ $tpl->printToScreen(); ” 语句前，添加如下代码：

// 添加自定义统计表信息

if (!isset($optional_graphs))

$optional_graphs = array();

foreach ($optional_graphs as $g) {

$tpl->newBlock('optional_graphs');

$tpl->assign('name',$g);

$tpl->assign("cluster_url", $cluster_url);

$tpl->assign("graphargs", "h=$hostname&$get_metric_string&st=$cluster[LOCALTIME]");

$tpl->gotoBlock('_ROOT');

}

该部分是通过替换 $GANGLIA_WEB/ templates/onest/host_view.tpl 的 optional_graphs 模块实现的。

b) 修改 $GANGLIA_WEB/ templates/onest/host_view.tpl

**************** Start ****************************

…….

<IMG BORDER=0 ALT="{cluster_url} NETWORK"

SRC="./graph.php?g=network_report&z=medium&c={cluster_url}&{graphargs}">

</A>

<IMG BORDER=0 ALT="{cluster_url} PACKETS"

SRC="./graph.php?g=packet_report&z=medium&c={cluster_url}&{graphargs}">

</A>

<IMG BORDER=0 ALT="{cluster_url} {name}"

SRC="./graph.php?g={name}_report&z=medium&c={cluster_url}&{graphargs}">

</A>

…..

************************* End ****************************

以上修改中，在节点视图中增加了网络包统计表以及自定义统计表

c) 修改 $GANGLIA_WEB/conf.php

修改 $template_name 参数 , 让其指向我们的模板目录

**************** Start ****************************

<?php

# $Id: conf.php.in 1688 2008-08-15 12:34:40Z carenas $

# Gmetad-webfrontend version. Used to check for updates.

include_once "./version.php";

# The name of the directory in "./templates" which contains the

# templates that you want to use. Templates are like a skin for the

# site that can alter its look and feel.

$template_name = "onest";

…..

************************* End ****************************

d) 重启 httpd 服务

执行命令： service httpd restart ，这样我们就可以在节点视图，看到节点的 random 统计了。

截图如下：

e) 自定义显示

如果不喜欢 Ganglia 显示界面，我们可以修改 $GANGLIA_WEB/ templates /$template_name 目录中相应的模板文件

附录 1 ：

$GANGLIA_WEB

|-- AUTHORS

|-- COPYING

|-- Makefile.am

|-- auth.php

|-- class.TemplatePower.inc.php

|-- cluster_legend.html

|-- cluster_view.php // 集群视图

|-- conf.php

|-- conf.php.in // 配置文件初始模板

|-- footer.php // 脚注

|-- functions.php

|-- ganglia.php

|-- get_context.php // 解析视图类型

|-- get_ganglia.php

|-- graph.d // 存放绘图脚本， metric 以及统计图表

| |-- cpu_report.php

| |-- load_report.php

| |-- mem_report.php

| |-- metric.php

| |-- network_report.php

| |-- packet_report.php

| |-- random_report.php

| `-- sample_report.php

|-- graph.php // 绘图脚本调用起点文件

|-- grid_tree.php

|-- header.php

|-- host_view.php // 节点视图

|-- index.php

|-- meta_view.php

|-- node_legend.html

|-- physical_view.php

|-- pie.php

|-- private_clusters

|-- show_node.php

|-- styles.css

|-- templates // 相关模板，可以修改该模板，自定义显示

| |-- default // 默认模板目录

| | |-- cluster_extra.tpl // 集群视图扩展内容模板

| | |-- cluster_view.tpl // 集群视图模板

| | |-- footer.tpl

| | |-- grid_tree.tpl

| | |-- header-nobanner.tpl

| | |-- header.tpl

| | |-- host_extra.tpl // 节点视图扩展内容模板

| | |-- host_view.tpl // 节点视图模板

| | |-- images

| | | |-- cluster_0-24.jpg

| | | |-- cluster_25-49.jpg

| | | |-- cluster_50-74.jpg

| | | |-- cluster_75-100.jpg

| | | |-- cluster_overloaded.jpg

| | | |-- cluster_private.jpg

| | | |-- grid_0-24.jpg

| | | |-- grid_25-49.jpg

| | | |-- grid_50-74.jpg

| | | |-- grid_75-100.jpg

| | | |-- grid_overloaded.jpg

| | | |-- grid_private.jpg

| | | |-- logo.jpg

| | | |-- node_0-24.jpg

| | | |-- node_25-49.jpg

| | | |-- node_50-74.jpg

| | | |-- node_75-100.jpg

| | | |-- node_dead.jpg

| | | `-- node_overloaded.jpg

| | |-- meta_view.tpl

| | |-- node_extra.tpl

| | |-- physical_view.tpl

| | `-- show_node.tpl

| `-- onest // 自定义模板目录，通过 conf.php 的 $template_name 参数指定

| |-- cluster_extra.tpl

| |-- cluster_view.tpl

| |-- footer.tpl

| |-- grid_tree.tpl

| |-- header-nobanner.tpl

| |-- header.tpl

| |-- host_extra.tpl

| |-- host_view.tpl

| |-- images

| | |-- cluster_0-24.jpg

| | |-- cluster_25-49.jpg

| | |-- cluster_50-74.jpg

| | |-- cluster_75-100.jpg

| | |-- cluster_overloaded.jpg

| | |-- cluster_private.jpg

| | |-- grid_0-24.jpg

| | |-- grid_25-49.jpg

| | |-- grid_50-74.jpg

| | |-- grid_75-100.jpg

| | |-- grid_overloaded.jpg

| | |-- grid_private.jpg

| | |-- logo.jpg

| | |-- node_0-24.jpg

| | |-- node_25-49.jpg

| | |-- node_50-74.jpg

| | |-- node_75-100.jpg

| | |-- node_dead.jpg

| | `-- node_overloaded.jpg

| |-- meta_view.tpl

| |-- node_extra.tpl

| |-- physical_view.tpl

| `-- show_node.tpl

|-- version.php

`-- version.php.in

参考资料：

1) 《 Custom Graphs in Ganglia 3.1.x 》

http://sourceforge.net/apps/trac/ganglia/wiki/Custom_graphs

2) 《 Ganglia Monitoring Tool 》

http://www.slideshare.net/sudhirpg/ganglia-monitoring-tool

3) 《针对 ganglia3.1.1 开发自定义的模块》

http://yaoweibin2008.blog.163.com/blog/static/11031392009085410345/

4) 《 Ganglia 和 Nagios ，第 1 部分 : 用 Ganglia 监视企业集群》

http://www.ibm.com/developerworks/cn/linux/l-ganglia-nagios-1/

5) PHP 语法参考

http://www.w3school.com.cn/php/php_looping.asp

6) RRD 数据库及 RRDTool 简介

http://linux.chinaunix.net/salon/200712/files/RRD_RRDTool_xa.pdf

http://www.seotcs.com/blog/1304.html

作为一个大型网站的监控，是一个涉及到各个方面的系统工作。要做好全面的web监控工作，其实不是那么容易的，作为Web的监控，笔者认为，应该包含以下几个方面：

1，服务器基础资源的监控。服务器基础资源的监控，包括服务器的内存、cpu以及磁盘等的使用监控。

这些基础资源的监控，意义是十分明显的，可以及时发现占资源的程序并尽快修复，也可以及时发现硬件资源的瓶颈，便于技术人员在第一时间将硬件进行扩充和升级。

2，web服务器的监控。web服务器的监控，主要是访问请求的监控，例如，httpd并发链接数的监控，平均响应时间等。

3，web数据库的监控。包含数据库进程资源的使用，频繁使用的sql的展示，耗资源的前几位的sql语句等。一般各大数据库服务商都带有监控的功能，例如oracle、mysql等。

4，应用服务器的监控。可以针对Web Services等的监控，监控一些重要api的调用情况，调用次数，单位时间内调用的频率等。

5，网络流量的监控。带宽的占用分析，也是相当重要的。网络流量的监控，可以及时发现异常流量的情况，例如DDOS的攻击，这样的流量攻击通过流量的监控，就可以及时发现并进行处理，避免网站因为流量攻击而瘫痪。

当然，监控虽然复杂，网络上也是有不少好用的开源工具可供下载使用的，下面来介绍一些重量级的开源监控工具。

1，Ganglia。

Ganglia是一个跨平台可扩展的，高性能计算系统下的分布式监控系统，它是UC Berkeley 发起的一个开源监视项目，设计用于测量数以千计的节点。每台计算机都运行一个收集和发送度量数据（如处理器速度、内存使用量等）的名为 gmond 的守护进程。它将从操作系统和指定主机中收集。接收所有度量数据的主机可以显示这些数据并且可以将这些数据的精简表单传递到层次结构中。正因为有这种层次结构模式，才使得 Ganglia 可以实现良好的扩展。gmond 带来的系统负载非常少，这使得它成为在集群中各台计算机上运行的一段代码，而不会影响用户性能。

2，Munin。

Munin是通过客户端－服务器架构收集数据并将其图形化的工具。Munin允许你跟踪你的主机的运行记录，就是所谓的‘节点’，然后将它们发送到中央服务器，随后你就能在这里以图像形式展示它们。

3，Cacti。

Cacti在英文中的意思是仙人掌的意思，Cacti是一套基于PHP,MySQL,SNMP及RRDTool开发的网络流量监测图形分析工具。它通过snmpget来获取数据，使用 RRDtool绘画图形，你可以不需要了解RRDtool复杂的参数。Cacti提供了非常强大的数据和用户管理功能，可以指定每一个用户能查看树状结构、host以及任何一张图，还可以与LDAP结合进行用户验证，同时也能自己增加模板，功能非常强大完善。

4，Nagios。

Nagios是一个强大的监控系统，号称IT架构监控中的行业标准。它可以让企业及时鉴别和解决IT设施中的问题，功能也是非常强大，主要表现在以下几个方面：

1）监控网络服务的方方面面(SMTP, POP3, HTTP, NNTP, ICMP, SNMP, FTP, SSH)；

2）监控主机资源（cpu负载，磁盘使用情况，系统日志等）；

3）服务等监视的并发处理；

4）各种预警通知功能 (通过手机短信，email或其他用户自定义方法)；

5）可指定自定义的事件处理控制器；

6）可选的基于浏览器的WEB界面以方便系统管理人员查看网络状态，各种系统问题，以及日志等

7）可以通过手机查看系统监控信息

总结一下，web监控的意义如此之大，所以是需要企业花大力气去关注和重视的，幸好监控的工具也是非常多，这就需要web技术或运维人员去了解和学习这些工具，以便为web的监控工作作出更多贡献。

http://www.haodaima.net/art/1186169

Ganglia监控系统--自定义插件

2011/12/30 17:49:42

当在运营的环境中使用ganglia后，你可能不在满足ganglia自身提供的插件，需要根据特定的需求描述特定服务的性能数据。接下来，我完整完成一个自定义插件的Hello World，以后翻阅。

我使用python编写插件,测试环境是ubuntu10.04+x86_64

基础环境准备

apt-get install ganglia-monitor

在/etc/ganglia/中创建conf.d目录

在conf.d目录中创建modpython.conf,其内容如下：

modules {
module {
name = "python_module"
path = "/usr/lib/ganglia/modpython.so"
params = "/usr/lib/ganglia/python_modules"
}
}

include('/etc/ganglia/conf.d/*.pyconf')

cd /usr/lib/ganglia目录，在其中创建python_modules目录，日后自行编写的python脚本都是需要放置在python_modules中。

到此，基础的环境准备已经完成。

自定义插件

其实自定义插件分为两步，第一步编写python脚本，第二步是向gmond注册此脚本

(1)第一步编写python脚本

脚本中需要包含如下三个函数：

metric_init(params):

Called once at module initialization time
Must return a metric description dictionary or list of dictionaries

Metric definition data dictionary ：

d = {

‘name’ : ‘<your_metric_name>’,

‘ call_back’ : <call_back function>,

‘ time_max’ : int(<your_time_max>),

‘ value_type’ : ‘<string | uint | float | double>’,

‘ units’ : ’<your_units>’,

‘ slope’ : ‘<zero | positive | negative | both>’,

‘ format’ : ‘<your_format>’,

‘ description’ : ‘<your_description>’

}

Can be a single dictionary or a list of dictionaries，Must be returned from the metric_init() function

metric_handler(): – may have multiple handlers

Metric gathering handler
Must return a single data value of the same type as specified in the metric_init() function

metric_cleanup()

Called once at module termination time
Does not return a value

举个简单的例子来说明下，进入上面提到的python_modules目录中，创建如test123.py

#!/usr/bin/env python
import random
def get_foo_count(name):
return random.randrange(23, 90) + 5

def metric_init(params):
'''''metric'''
global descriptors

d1 = {
'name': 'test_count',
'call_back': get_foo_count, 
'time_max': 90, #调度时间间隔 
'value_type': 'uint',
'units': 'C',
'slope': 'both',
'format': '%u',
'description': 'Number of test',
'groups': 'test_group'
}
descriptors = [d1]
return descriptors

def metric_cleanup():
pass

if __name__ == '__main__':
metric_init({})
for d in descriptors:
v = d['call_back'](d['name'])
print 'value for %s is %u' % (d['name'], v)

(2)注册编写的脚本

cd /etc/ganglia/conf.d/

创建test123.pyconf,其内容如下：

modules {

module {

name = "test123"

language = "python"

}

}


collection_group {

collect_every = 10 

time_threshold = 50

metric {

name = "test_count"

title = "test data" /* Metric name (see “gmond –m”) */

value_threshold = 50 /* Metric variance threshold (send if exceeded) */

}

}

重新启动gmond,在web 界面就应该能够看到对应的图表。

说明

可能在构建的过程中遇到权限相关的问题，在调试的时候也可以使用 tail -f /var/log/syslog,帮助解决问题

http://dudushunai2008like.lofter.com/post/373687_1082aea

【ganglia】 ganglia扩展之gmetric and Python实现方法

来自：bkeep

1 概述 Ganglia 扩展能力：

使用 Ganglia 的插件将给我们提供两种添加更多功能的方法：

方法一：通过添加带内（in-band）插件。gmetric命令

方法二：通过添加一些其他来源的带外（out-of-band）欺骗。

c或者python接口来实现

说明：hadoop使用metric2插件另外研究研究。

2 gmetric命令详解 2.1 示例1：

#gmetric -n test_string -v 'hello value' -t string -d 10 -c /etc/ganglia/gmond.conf.bkeep -S '2.2.2.2:web'

解说：-n '指标名' -v 指标值 -t 数据类型 -u '单位' -d 指标的存活时间 -c 指定ganglia配置文件 -S伪装客户端信息2.2.2.2代表ip地址，web代表主机名。

2.2 示例2：

#gmetric -n bkeepUse -v 10 -t int32 -u '% test' -d 10 -S '2.2.2.2:web'

2.3 gmetric语法：

#gmetric --help

gmetric 3.1.2

Purpose:

The Ganglia Metric Client (gmetric) announces a metric

on the list of defined send channels defined in a configuration file

Usage: gmetric [OPTIONS]...

-h, --help Print help and exit

-V, --version Print version and exit

-c, --conf=STRING The configuration file to use for finding send channels

(default='/etc/ganglia/gmond.conf')

-n, --name=STRING Name of the metric

-v, --value=STRING Value of the metric

-t, --type=STRING Either

-u, --units=STRING Unit of measure for the value e.g. Kilobytes, Celcius

(default='')

-s, --slope=STRING Either zero|positive|negative|both (default='both')

-x, --tmax=INT The maximum time in seconds between gmetric calls

(default='60')

-d, --dmax=INT The lifetime in seconds of this metric (default='0')

-S, --spoof=STRING IP address and name of host/device (colon separated) we

are spoofing (default='')

-H, --heartbeat spoof a heartbeat message (use with spoof option)

3 python插件扩展ganglia 3.1 首先安装gangliapython模块，即modpython.so

安装包：

#rpm -qf /usr/lib64/ganglia/modpython.so

ganglia-gmond-python-3.1.2-5.el5

安装后的文件位置：

[root@inc-dw-hadoop-3 /usr/lib64/ganglia]

#ls

modcpu.so modload.so modmulticpu.so modproc.so modsys.so

moddisk.so modmem.so modnet.so modpython.so python_modules

3.2 配置gmond.conf,添加扩展的模块

1，vi /etc/ganglia/gmond.conf

#如果没有下面这句，请添加

include ('/etc/ganglia/conf.d/*.conf')

2,创建modpython.conf文件

vi /etc/ganglia/conf.d/modpython.conf

# The modules section describes the module

# that should be loaded.

# name - module name

# path - load path of the .so

# params - path to the directory where mod_python

# should look for python metric modules

modules {

module {

name = "python_module" #python主模块

path = "modpython.so" #动态库路径

params = "/usr/lib64/ganglia/python_modules" #指定我们编写的python脚本放置位置

}

include ('/etc/ganglia/conf.d/*.pyconf') #该目录下只能有一个文件包含此配置，否则会造成死循环。把机器搞死。

3.3 实战之开发python模块 3.3.1 实验环境

进程机器

------------ -----------------------------

gmond inc-dw-hadoop-3

gmetad dw-ganglia-3

httpd dw-ganglia-3

3.3.2 创建random_module.pyconf模块配置文件：

注意：请区别于modpython.conf；自定义python模块配置文件只针对自己开发的模块。

[root@inc-dw-hadoop-3 /etc/ganglia/conf.d]

#cat random_module.pyconf

modules {

module {

#模块名，该文件存放于params = "/usr/lib64/ganglia/python_modules"指定的路径下

name = "random_module"

#声明使用python语言

language = "python"

#参数列表，所有的参数作为一个dict(即map)传给python脚本的metric_init(params)函数。

#本例中，metric_init调用时，params={“RandomMax”:”10”,”RandomMin”:”0”}

param RandomMax{

value = 10

}

param RandomMin{

value = 0

}

#需要收集的metric列表，一个模块中可以扩展任意个metric

collection_group {

collect_every = 10

time_threshold = 50 #最大发送间隔

metric {

name = "random1" #metric在模块中的名字

title = "test random1" #图形界面上显示的标题

value_threshold = 50

}

metric {

name="random2"

title = "test random2"

value_threshold = 50

}

3.3.3 编写random_module.py模块：

具体采集那些信息？有什么语法规定吗？仔细读下面的例子就能发现。

#cd /usr/lib64/ganglia/python_modules/

#vi random_module.py

import random

random_max = 100

random_min = 0

v = 0

def random1_handler(name):

global v,random_max,random_min

v = random.randint(random_min,random_max)

return v

def random2_handler(name):

global v,random_max,random_min

return random_min+random_max-v

def metric_init(params):

global random_max,random_min

if params:

if params.has_key("RandomMin"):

random_min = int(params["RandomMin"])

if params.has_key("RandomMax"):

random_max = int(params["RandomMax"])

tmp = {'name':'random1','call_back':random1_handler,

'value_type':'uint','units':'usage',

'slope':'both','format':'%u',

'description':'test random plugin',

'groups':'random'}

descriptors = [tmp]

tmp1 = {'name':'random2','call_back':random2_handler,

'value_type':'uint','units':'usage',

'slope':'both','format':'%u',

'description':'test subs plugin',

'groups':'random'}

descriptors.append(tmp1)

return descriptors

def metric_cleanup():

pass

if __name__=='__main__':

descriptors = metric_init(None)

for d in descriptors:

print "value for %s is %d"%(d['name'],d['call_back'](d['name']))

3.3.4 metric开发接口：

Ganglia模块扩展时，Python脚本主要要实现的函数有：

metric_init(params):

? Called once at module initialization time

? Must return a metric description dictionary or list of dictionaries

Metric definition data dictionary ：

d = {‘name’ : ‘<your_metric_name>’,

'call_back’ : <call_back function>,

'time_max’ : int(<your_time_max>),

'value_type’ : ‘<string | uint | float | double>’,

'units’ : ’<your_units>’,

'slope’ : ‘<zero | positive | negative | both>’,

'format’ : ‘<your_format>’,

'description’ : ‘<your_description>’}

Can be a single dictionary or a list of dictionaries

Must be returned from the metric_init() function

? Any other module initialization can also take place here

? metric_handler() – may have multiple handlers

? Metric gathering handler

? Must return a single data value of the same type as specified in the metric_init() function

? metric_cleanup()

? Called once at module termination time

? Does not return a value

3.3.5 random_module.py输出效果图：

完成以上步骤后，重启gmond ，就可以在 web 界面的节点视图上看到新添的 random1 和 random2 的统计图表

3.4实战之增加统计表（将多条曲线画在一张图上） 3.4.1 思路

思路：根据url得知关键字---->ganglia web 目录底下去搜索相关文件---->模仿即可。

说明：memory和network模版不一样，需要分别模仿。

[root@dw-ganglia-3 /usr/share/ganglia]

# grep "network_report" * -r

get_context.php: "network_report" => 1,

graph.php:$graph = isset($_GET["g"]) && in_array( $_GET['g'], array( 'cpu_report', 'mem_report', 'load_report', 'network_report', 'packet_report' ) ) ?

graph.php: else if ($graph == "network_report")

templates/default/host_view.tpl:<A HREF="./graph.php?g=network_report&z=large&c={cluster_url}&{graphargs}">

templates/default/host_view.tpl: SRC="./graph.php?g=network_report&z=medium&c={cluster_url}&{graphargs}">

templates/default/cluster_view.tpl:<A HREF="./graph.php?g=network_report&z=large&{graph_args}">

templates/default/cluster_view.tpl: SRC="./graph.php?g=network_report&z=medium&{graph_args}">

3.4.2编辑get_context.php

[root@dw-ganglia-3 /usr/share/ganglia]

#vi get_context.php

#add by bkeep

$reports = array(

"load_report" => "load_one",

"cpu_report" => 1,

"mem_report" => 1,

"network_report" => 1,

"random_report" => 1,

"packet_report" => 1

3.4.3编辑graph.php

这里演示了增加random曲线，后面会演示增加多条曲线的方法

[root@dw-ganglia-3 /usr/share/ganglia]

#vi graph.php

#$graph = isset($_GET["g"]) && in_array( $_GET['g'], array( 'cpu_report', 'mem_report', 'load_report', 'network_re

port', 'packet_report' ) ) ?

#modify by bkeep

$graph = isset($_GET["g"]) && in_array( $_GET['g'], array( 'cpu_report', 'mem_report', 'load_report', 'network_rep

ort', 'packet_report' ,'random_report') ) ?

# add by bkeep

#else if ($graph == "network_report")

else if ($graph == "random_report")

{

$fudge = $fudge_2;

#$style = "Network";

$style = "Random";

$lower_limit = "--lower-limit 0 --rigid";

$extras = "--base 1024";

$vertical_label = "--vertical-label 'Bytes/sec'";

#$series = "DEF:'bytes_in'='${rrd_dir}/bytes_in.rrd':'sum':AVERAGE "

$series = "DEF:'random1'='${rrd_dir}/random1.rrd':'sum':AVERAGE "

#."DEF:'bytes_out'='${rrd_dir}/bytes_out.rrd':'sum':AVERAGE "

."DEF:'random2'='${rrd_dir}/random2.rrd':'sum':AVERAGE "

#."LINE2:'bytes_in'#$mem_cached_color:'In' "

."LINE2:'random1'#$mem_cached_color:'In' "

#."LINE2:'bytes_out'#$mem_used_color:'Out' ";

."LINE2:'random2'#$mem_used_color:'Out' ";

}

3.4.4编辑templates/default/host_view.tpl

[root@dw-ganglia-3 /usr/share/ganglia]

#vi templates/default/host_view.tpl

<IMG BORDER=0 ALT="{cluster_url} NETWORK"

SRC="./graph.php?g=network_report&z=medium&c={cluster_url}&{graphargs}">

</A>

<IMG BORDER=0 ALT="{cluster_url} NETWORK"

SRC="./graph.php?g=random_report&z=medium&c={cluster_url}&{graphargs}">

</A>

3.4.5编辑templates/default/cluster_view.tpl

[root@dw-ganglia-3 /usr/share/ganglia]

#vi templates/default/cluster_view.tpl

<IMG BORDER=0 ALT="{cluster} NETWORK"

SRC="./graph.php?g=network_report&z=medium&{graph_args}">

</A>

<IMG BORDER=0 ALT="{cluster} NETWORK"

SRC="./graph.php?g=random_report&z=medium&{graph_args}">

</A>

3.4.6添加多条曲线

vi /usr/share/ganglia/graph.php

# add by bkeep

#else if ($graph == "network_report")

else if ($graph == "random_report")

{

$fudge = $fudge_2;

#$style = "Network";

$style = "Random";

$lower_limit = "--lower-limit 0 --rigid";

$extras = "--base 1024";

$vertical_label = "--vertical-label 'Bytes/sec'";

#$series = "DEF:'bytes_in'='${rrd_dir}/bytes_in.rrd':'sum':AVERAGE "

$series = "DEF:'random1'='${rrd_dir}/random1.rrd':'sum':AVERAGE "

#."DEF:'bytes_out'='${rrd_dir}/bytes_out.rrd':'sum':AVERAGE "

."DEF:'random2'='${rrd_dir}/random2.rrd':'sum':AVERAGE "

."DEF:'bytes_out'='${rrd_dir}/bytes_out.rrd':'sum':AVERAGE "

."DEF:'bytes_in'='${rrd_dir}/bytes_in.rrd':'sum':AVERAGE " #定义一个样式

#."LINE2:'bytes_in'#$mem_cached_color:'In' "

."LINE2:'random1'#$mem_cached_color:'randmon1' "

#."LINE2:'bytes_out'#$mem_used_color:'Out' ";

."LINE2:'random2'#$mem_used_color:'random2' "

."LINE2:'bytes_out'#$mem_used_color:'bytes_out' "

."LINE2:'bytes_in'#$cpu_nice_color:'bytes_in' "; #定义颜色

}

看看效果图，四条曲线。

2014-04-08

http://segmentfault.com/q/1010000000116157

系统的服务器多了，独立运行的服务进程多了，服务进程间的通讯多了，该做那些监控，该怎么监控？有没有什么成熟的思想想法？
监控是不是可以分为2个方面：1）系统级别的监控(cpu,memory,io,disk,net)，服务是否存活
2)应用级别（各子系统业务相关异常监控）
具体的，怎么来实现这个监控，做到一个可灵活配置、扩展的插件式监控平台？感觉还是比较棘手

综合了大家的回答，打算先这么做：
1：Nagios作为CPU、内存、硬盘等各个基本非业务的监控
2：各个业务模块做自己相关的监控：服务异常监控、服务统计信息等
1）服务异常信息通过mq异步的发送给监控主服务器，由监控主服务器统一处理
2）服务统计信息先在本地模块内存汇总，然后定时间隔的发送给监控主服务器进行持久化等相关处理

链接
编辑
评论
更多

默认排序时间排序

6 个回答

采纳

Ajian 1.4k 2012年09月06日回答 · 2012年09月07日更新

`以下都是自己想到什么写什么
监控从方向来分为：　系统级别监控和业务逻辑层监控。一般的开源软件都是面向系统软件级别的监控，　不可能会有业务逻辑的监控；　业务逻辑的监控因为不同的应用而不同，　这个需要程序员预留接口可以进行监控，　运维是可以提需求的。
监控从功能上分为：　报警监控和性能监控。　报警监控，就像大家说的nagios是非常好的开源软件，　其实nagios提供的也是一种监控的框架，　所以他比较的灵活；　性能监控，　主要是用来查看变化趋势,　可以更好的找到问题,　或者提早发现问题,　有时候因为报警的阀值是需要不断的调整才能到最佳状态,像cacti和ganglia
监控的选择　一般要看你的服务器分布:
如果是分布式的机房,　机房很多,　那么对集中监控和处理要求比较高,　ganglia本身就有分布式特性,　是第一选择;　nagios需要再做些插件的优化和结构调整才能更好的支持分布式的需求.　因为分布式面临的问题是集中管理和可靠性,　可靠性:　网络传输可能出现的问题都要避免监控，才能让监控准确;　集中管理:　才可以减少工作量
如果是集中的,　在量很大的情况下还是建议使用ganglia,　如果小其它的很多监控都可以选择,　报警监控还是用nagios,　好像很少有他这样灵活的工具,　但一定要将配置改成最适合自己环境的,　并且最简单和快速的配置　需要自己制定一些规则会比较好。
如果说要监控配合的外围工具:　像短信报警　邮件　都需要自己做些工具会比较好　,都是为了保证报警的可靠性　监控前期一定要多关注是否跟上了需求　要做很多的调整　不是说搭建了就万事大吉了.

评下你的做法
综合了大家的回答，打算先这么做：
1：Nagios作为CPU、内存、硬盘等各个基本非业务的监控
＃其实nagios也可以监控业务逻辑　主要是首先要知道要监控哪些业务逻辑　再程序方面是否有相应的接口　如果没有是否可以做　再自己写一些相应的脚本　nagios和ganglia都可以很方便的写脚本。最关键的还是监控需求和程序的支持情况
2：各个业务模块做自己相关的监控：服务异常监控、服务统计信息等
1）服务异常信息通过mq异步的发送给监控主服务器，由监控主服务器统一处理
＃你应该说的是自己写监控再通过队列发送给主服务,如果是同机房当然还是写nagios的插件会比较好,这样是统一管理,而只需要写插件;　如果是机房是分布的,可以考虑nagios之间的消息传递写一些脚本完成,自己写的话是时间问题和管理上不统一的麻烦。
2）服务统计信息先在本地模块内存汇总，然后定时间隔的发送给监控主服务器进行持久化等相关处理
＃这一部分我建议是分成两部分:　第一部分是服务器基本信息,　像cpu　内存　硬盘　这些不会变化的可以间隔很长时间,　其实ganglia默认就有系统硬件的所有信息，　只是如果想放到表格里面对比就差些了；　反而对于系统用户　磁盘容量　各种配置文件　如计划任务　打开的服务　自启动的内容可以定时的执行和收集,　这个应该属于备份了,　但如果所有的配置集中处理之后,像使用puppet或者其它配置工作,这些都不需要做了。
我这有个服务器信息收集的　是适合自己用的　[Shell]服务器信息收集与整理输出wiki和excel　http://www.ohlinux.com/archives/824/`

链接
编辑
4 评论
更多

周健_MediaV 61 2012年09月05日回答

应该监控任意两个模块之间的连接，不管你是用tcp/udp，还是thrift或者其他什么rpc server，都应该在client端记录success/fail/timeout的QPS以及latency，而在server端记录qps。
写一个监控的类库，不复杂的，有个全局的map，然后有个线程每秒能够进行计数，并把相应的数字通过一些网络接口暴露出来，这样就可以和各种Nagios/Zabbix等来集成了。
还有一个经验是每个服务最好能够注册到一个内部的Name Service中，比如Zookeeper，这样就不用把上下游的信息作为配置信息搞来搞去了。

链接
编辑
评论
更多

tangsty_116203 18 2012年09月05日回答 · 2012年09月05日更新

服务器数量不太大时，比如小于200台，建议试试Nagios Nagios监控系统的CPU、内存、硬盘等各个基本面都很方便。监控自己的服务也很容易，组合些插件、写点简单的脚本啥的就能做到。

如果服务器数量很多，超过1000台。高效采集这些信息就是个复杂的事情，想想每秒钟要有多少数据往你的监控服务器传送就有点头疼了。这就需要自己精心设计下拓扑结构、写不少代码。当然，也能利用已有的开源框架做到这些信息采集。

链接
编辑
评论
更多

蛙蛙王子 17 2014年05月26日回答

主要分系统监控和业务监控两类吧

系统监控就是每台主机的CPU,内存，网络带宽等使用情况，以及Mysql, Redis, Nginx等服务的核心指标等，这是比较基本的监控，必须得有，如果这块监控做的好，生产环境可以提前发现很多问题，防患于未然。

业务监控就是业务相关的指标，如某API每秒调用次数，每分钟该API的平均响应时间，服务的在线人数，甚至一些运营相关的数据，如七日留存率啦，每日新增用户，每日流失用户等。这些数据也很重要，他是你整个业务的晴雨表，为你做一些重要决策提供依据。

对于系统监控，有很多开源软件可以拿来用，如比较出名的ngios,cacti,zabbix等，部署都比较复杂，客户端要部agent，还得装一个center用来收集，存储展现数据，还有好多插件需要维护。但有一个比较简单的东西是collectd，它自带了各种插件，如系统CPU，磁盘利用率，mysql,nginx,redix等常用服务都可
以进行监控，而且自动给你推荐了要监控哪些指标。安装很方便，基本上./configuration && make && make install就可以了。

对于业务监控，肯定是需要自己写代码上报业务数据的，现在比较流行的方案是statsd+graphite，比较轻量级，而且有很多语言的sdk，可以很轻松把各种指标监控起来。

大多监控体系都差不多，如下

每台机器上安装一个agent，用来采集本机的性能数据，服务数据
每台机器部署的业务，根据一个sdk，向center提交本业务相关的数据
每个agent可以动态的按需求加载一些插件，以便监控新的指标
一般一个机房内有一个center用来收集各agent和各业务上报的指标
center要把采集到的指标数据进行存储，归档，压缩，一般用rrd database
center还得有一个web界面来查看各个指标的历史图表，甚至要有各种视图和dashborad来显示一组相关的指标。
center还要每天把用户自定义的几个关键的指标生产报表发给运维或者相关人员。
center还需要保存各种告警规则，如某个指标连续几次超过某个阈值产生告警，或者波动超过某个范围产生告警，或者某个指标超过多长时间没有上报数据产生告警
center还要进行各种告警的收敛，如同类告警的合并，临时屏蔽某类告警，防止因为网络抖动引起大量告警等，没有这些运维人员会淹没在各种告警声中。
center要以各种方式将告警发送给运维人员，如短信，邮件，微信，语音等。
center还要对每次告警进行回顾，统计，分析，得出每个系统的薄弱点，可用率，在线时间，稳定性等。

所以说，自己搭建一套完善可靠的监控体系，挺不容易的，需要投入大量的人力和精力去开发和维护。

现在国外也有一些专门做运维外包的厂商，center托管在给他们，免去了很大的工作量，剩下的agent和plugin还是得自己安装，但这就简单了，反正有很多可以做批量部署的运维工具。

比较出名的有NewRelic,StatHat,hostedgraphite，可以去了解一下，基本上就是安装个agent就可以向它们的center上报数据了，或者是利用他们的Sdk提交一些自定义数据，他们负责存储，展现，告警方面的事情，节省很多人力。

国内的话，也有人做类似的事情，如DNSPod的D监控最近推出了自定义监控的功能，兼容graphite的上报接口，你自己部署个collectd就可以把各种系统监控指标监控起来了，如果要做业务监控，graphite也有各种语言的sdk。graphite本身开源，周边工具和软件也特别多，能满足很多的需求。

IT管理员常用的管理、运维工具有哪些？修改

用来做网络监控、资产管理等。修改

举报添加评论 • 邀请回答

按票数排序按时间排序

14 个回答

陈湛翀，从事运维工作

roger wang、刘靖、金枪鱼等人赞同

监控系统用 nagios ，除了普通自带的plugin ，还必须学会自己写plugin ，满足自己的定制化监控需求。nrpe 同样重要，对历史监控数据画图。使用 nagios 主要是因为它的故障报警机制。
流量监控用 cacit ，同样必须学会配置cacti 的模版，写自己的脚本。
集群监控用 ganglia ，分布式的监控，并没有严格的单点，以上两个都是有严格的monitor server client ，就是server 发送信号让client 执行监控任务并手机数据。ganglia 没有这种缺陷。
ping 监控用 ipmonitor ，不过其实它有点延迟，我一般自己写ping 监控。

然后就是自己写工具了。参考：perl + rrdtool ，自己写脚本监控并画图。

比较守旧，我就只用几个比较老牌的监控软件。

发布于 2011-09-16 添加评论

饶琛琳，DevOps / Perl / 格律诗词

ETD53、陈湛翀赞同

补充陈湛翀的：
nagios的绘图插件pnp4nagios蛮好用的，我觉得跟学习cacti自制模板来说，pnp的学习成本也很小~
关于monitor client的问题，我有点不同意见，ganglia也需要在每台上部署gmond程序；这跟nrpe和snmpd的区别应该是在监控模式是主动还是被动。事实上nagios也有nsca的模式啊。

发布于 2011-11-14 添加评论

dccmx，搞技术的

苏好铁、funs4fe、安江泽赞同

个人喜好：
配置管理：puppet
指标收集：collectd+statsd(statsite)
指标绘图：graphite
报警工具：nagios
版本管理：git

发布于 2012-12-09 添加评论

伍卫民，不纯粹的互联网运维

韩雨伸、罗辉赞同

流量监控用cacti，可以自定义很多相关的数据图形或是报警。
资产管理可以试下glpi

发布于 2011-05-10 添加评论

胡志伟，运维 linux

苏好铁、atomd 赞同

目前自己觉得比较上手的3大监控工具：
1.nagios，优势：报警的及时性和web界面的显示
劣势：配置相对比较繁琐，无web界面管理添加主机，数据无存档
2.cacti，优势：以snmp为主要取数工具，取得数据后存入rra文件以rrdtools展现监控数据，相对添加主机和性能趋势分析有比较大优势
劣势：报警的及时性有所欠缺
3zabbix，优势：分布式服务器监控，可以对多机房统一监控，一个局域网配置一台proxy代理主机收集数据汇总给监控主机，支持snmp zabbix_agentd等多种方式监控，对自定义数据收集比cacti脚本简单，有设备报备功能等等，功能十分强大适合大型分布式架构
劣势：相对来说占用资源较高，数据以mysql数据库存储为主，graph显示较慢（可能是因为数据量太大，没有cacti那么流畅）

总的来说，用对适合自己实际情况的工具最重要。

编辑于 2011-09-18 2 条评论

蓝天

俞巍巍、苏好铁赞同

具备网络监控和资产管理，zabbix开源解决方案最合适不过了。比较完整的监控项，可定制化较强。导入导出监控配置模版，上手了就离不开它。
涉及运维自动化的话有：
1 系统安装自动化：kickstart，cobbler等
2 配置部署自动化：puppet，saltstart，ansible等
3 监控自动化 nagios，ganglia，cacti等
以上都是开源运维常用工具。像bat的sa们都自己开发定制运维系统了。
手机码字，就写这么多。

发布于 2014-04-02 添加评论

张涤凡，养猪专家

曹理虎赞同

nagios不错。OSSim正在考察。
流量管理当然用CACTI的。

发布于 2011-10-31 添加评论

xia luke，云计算/IT管理/市场营销/SaaS/睡觉爱好者

律香川赞同

ManageEngine的，网络监控、资产管理、ITSM、APM、流量分析。。。，啥都有。用起来简单点。只是非开源工具，价格比IBM/HP的应该便宜。

发布于 2014-04-03 添加评论

刘生，财富人生

苏叶赞同

大家罗列的都蛮全面的了，我来补充下，IT运维还需要ITIL管理工具，用于处理日常报障，变更发布，知识库管理等的平台(E8.ITSM)。以及运维KPI考核达标。。。

发布于 2014-06-10 添加评论

杨宇，戴尔公司，IT在于折腾

Zenoss,好像国内的用户不多，bestv百视通正在用这个。

发布于 2011-12-05 添加评论

宋明明，走理想主义道路的现实主义者

1、监控nagios和zabbix都可以，分布式虽然两者皆可，但是zabbix在分布式上的应用更为宽泛一点
2、另外流量好像大家都用cacti，但是制作模板倒是一个问题，所以干脆索性利用rrdtool+自己编写工具出图吧，以前单位就是php自己写工具，用rrdtool绘，也还可以；
3、配置工具：这边现在单位很多，salt、puppet等等。个人喜欢salt毕竟和python亲一点，也还蛮舒服
4、版本，个人比较喜欢git，但是单位部分用svn，部分用git
5、wiki工具：单位用doukuwiki，基于php，而且不用数据库支持，清爽，但是个人喜欢不基于数据库的gollum，html支持markdown，清爽简洁，另外有个wiki使用python写的，也蛮有兴趣
6、运维语言，单位是php主体，也有部分python这块的，我个人走python路线

发布于 2013-12-05 添加评论

Sun Kairong，运维工程师

目前在用zabbix，不过貌似使用起来不太方便，唉

发布于 2014-04-02 添加评论

nowo，知之为知之不知为不知

网络监控及资产管理的需求，使用 zabbix 完全能够满足。
zabbix 持续在开发改进中，除普通的监控项目外，也支持自定义监控，如对数据库的监控、对系统连接数量的监控等等，基本上能够用到的监控项目，zabbix 都能够比较简单的实现，自定义功能非常强大。
我认为这是目前中小企业适用最好的监控软件，没有之一。原因如下：
1、开源。开发活跃，版本更新升级非常快，意味着新功能的推出及 bug 修复都很迅速。
2、简单。MySQL + PHP 的前台展示，配置界面一目了然，熟悉之后很容易实现需要的监控功能。
3、强大。监控类型多种多样，agent 支持多种平台，内置了各种监控项，再加上可自定义，基本涵盖了目前常用的监控需求。

编辑于 2014-06-05 添加评论

何晓阳，北京蓝海讯通科技有限公司 CEO

楼主上来讲IT管理员的工作内容圈定为网络监控、资产管理，这是通常国内的做法，但是在国外，IT管理员的工作是以应用、业务、Devops、用户体验为中心的，因此IT运维的核心应该是应用性能管理，而不是别的。

http://article.yeeyan.org/compare/388225

《五种免费的网络监控工具》

本文作者：Jack Wallen，文章摘自《五种应用软件》

发布时间：2012年3月28日

在各种网络与系统控制工具中，你将会发现一些能与那些付费昂贵的监控器相比拟的免费使用的监控工具。

如果你是一个系统或网络管理员，你会需要监控工具。你必须时刻了解你的整个网络系统运行的状态，如此你才能实现网络系统的最优化并阻止系统潜在的问题的出现。谢天谢地，有许多监控工具都能帮助你随时了解整个网络的运行状态。其中一些监控工具的价格非常昂贵，而且确实能够帮助你不少。但是，还有一些其他免费的监控工具同样能为你提供相同的帮助--而且在某些情况下，它们提供的帮助甚至更多。此话不假，确实是更多哦。

下面我想向您介绍如下五种系统和/或网络监控工具，它们的功能远超乎你的想象。我确信，你肯定会从以下列举出的几种监控工具中找到能满足你需要的一个或者多个监控工具。

提示：这个列表也可作为一个相册。

Figure A1.Observium（图A）是一款基于PHP/MySQL/SNMP研发的可自动发现网络系统问题的网络监控工具。它主要监控Linux、UNIX、Cisco、Juniper、Brocade、Foundry、HP等等企业的网络系统。使用Observium，你会发现网页上的图标显示得很清晰，而且操作界面也变得越发的便捷。它能监控数据庞大的程序和系统。Observium唯一的弊端是缺乏自动报警器。但是为了这一缺陷，你可以为Observium设置一个类似Nagios的系统监控程序，用于向上/向下警示。

Figure B

2.Ganglia(图B)是一款可伸缩的分布的监控系统，它主要监视和显示集群中的节点的各种状态信息。它为你提供了一个快速和易于阅读整个集群系统的界面。这个监视器已经被移植到许多平台和全球成千上万的集群上使用。每一位使用服务器集群的人都应该使用Ganglia监控系统。Ganglia可以处理多达2000个节点的集群规模。

Figure C

3.Spiceworks（图C）正日益成为一款企业标准的网络/系统监控工具。尽管你必须容忍一些小广告，但是它的特色和网页界面却无与伦比。Spiceworks能监控（或自动）监控你的整个网络运行系统，在系统运行出现故障时，它会向你发出警报，此外，它还能为你提供清晰的图形工具。它还允许您通过Spiceworks与社区的其他It专业人士进行交流。

Figure D

4.Nagios(图D)被许多人认为是开源码网络监控系统之王。虽然不是最简单的设置和配置(你必须手动编辑配置文件)工具,但是Nagios的功能确实是令人难以置信的强大。尽管手动配置的想法可能会关闭一些功能,但其实手动设置使得Nagios成为了最灵活的网络监控工具。最后,Nagios提供的大量的功能简直是无可匹敌的。你甚至可以设置来自电子邮件,短信,和纸质印刷品的警报!

Figure E

5.Zabbix（Figure E）与其他网络监控工具一样功能强大，它同样提供了用户自定义网页界面视图、视图缩放以及画图功能。Zabbix能提供无需经过代理的监控功能，它能为你提供你想监控的各类数据，SLA 报告，而且它还能监控10000台以上的设备。你甚至能从这款开源码监控工具那儿获得商业支持。Zabbix 的一个独特功能是可设置自动报警器。如果发现异常状况，Zabbix会播放声频报警文件。

http://www.langyuweb.com/a/wangluozhishi/2013/0111/29762.html

云策画平台经管的三大利器Nagios、Ganglia和Splunk

http://www.51studyit.com/html/notes/20140627/977.html

1. 安装rrdtool

1.1. 安装依赖包

Ø 拷贝CentOS软件包

将CentOS安装光盘中的rpm包拷贝到/root/下，以备使用。

Ø 安装zlib开发包

rpm -ivh /root/CentOS/zlib-devel-1.2.3-3.x86_64.rpm

Ø 安装freetype开发包：

rpm -ivh /root/CentOS/freetype-devel-2.2.1-21.el5_3.x86_64.rpm

Ø 安装libart开发包：

rpm -ivh /root/CentOS/libart_lgpl-devel-2.3.17-4.x86_64.rpm

Ø 安装libpng开发包：

rpm -ivh /root/CentOS/libpng-devel-1.2.10-7.1.el5_3.2.x86_64.rpm

1.2. 安装rrdtool

Ø 准备rrdtool源码包

rrdtool-1.2.27.tar

tar zxvf rrdtool-1.2.27.tar

cd rrdtool-1.2.27

Ø 编译安装

./configure –prefix=/usr/local/rrdtool

make

make install

Ø 验证是否安装成功

运行命令/usr/local/rrdtool/bin/rrdtool

如果出现以下文字说明安装成功：

Compiled May 24 2011 11:46:06

...

Ø 可能出现的问题

pkg-config找不到lib包，不用纠结于pkg-config，一定是某些lib没装好。

在configure rrdtool时会提升哪些包未装好,记得安装这些lib的源码包

2. 安装ganglia

2.1. 安装依赖软件

ganglia依赖以下软件：、

expat-2.0.1.tar.gz，apr-1.3.2.tar.bz2，apr-util-1.3.2.tar.bz2， confuse-2.6.tar.gz

Ø 安装expat-2.0.1.tar

tar zxvf expat-2.0.1.tar.gz

./configure --prefix=/usr/local/expat

make

make install

mkdir /usr/local/expat/lib64

cp -a /usr/local/expat/lib/* /usr/local/expat/lib64/

32位操作系统不用做上面两步

Ø 安装apr-1.3.2.tar.bz2

tar xvjf apr-1.3.2.tar.bz2

./configure --prefix=/usr/local/apr

make

make install

Ø 安装apr-util-1.3.2.tar.bz2

tar xvjf apr-util-1.3.2.tar.bz2

./configure –with-apr=/usr/local/apr –with-expat=/usr/local/expat

make

make install

cp /usr/local/apr/include/apr-1/* /usr/local/apr/include/

因为ganglia安装时默认会去/usr/local/apr/include下寻找apr的库文件

mkdir /usr/local/apr/lib64

cp -a /usr/local/apr/lib/* /usr/local/apr/lib64/

32位操作系统不用做上面两步

Ø 安装confuse-2.6.tar.gz

tar zxvf confuse-2.6.tar.gz

./configure CFLAGS=-fPIC --disable-nls --prefix=/usr/local/confuse

make

make install

mkdir /usr/local/confuse/lib64

cp -a /usr/local/confuse/lib/* /usr/local/confuse/lib64/

32位操作系统不用做上面两步

2.2. 安装ganglia管理端（gmetad）

（1）解压

tar zxvf ganglia-3.1.7.tar.gz

（2）配置

./configure --prefix=/usr/local/ganglia --with-librrd=/usr/local/rrdtool --with-libapr=/usr/local/apr --with-libexpat=/usr/local/expat --with-libconfuse=/usr/local/confuse --with-gmetad --enable-gexec --enable-status --sysconfdir=/etc/ganglia

（3）编译，安装

make

make install

（4）为rrdtool创建数据存放目录

mkdir -p /var/lib/ganglia/rrds

chown -R nobody:nobody /var/lib/ganglia/rrds

（5）将gmetad添加到系统服务

cp {ganglia解压目录}gmetad/gmetad.init /etc/init.d/gmetad

（6）将gmetad命令加入到/usr/sbin/目录

cp /usr/local/ganglia/sbin/gmetad /usr/sbin/

（7） gmetad给 chkconfig托管

chkconfig --add gmetad

（8）开启gmetad服务

service gmetad start

Starting GANGLIA gmetad: [ OK ]

见到[OK]字样说明启动成功。

2.3. 安装ganglia节点（gmond）

（1）解压同gmetad

（2）配置：比gmetad时少“–with-gmetad”参数

（3） ./configure –prefix=/usr/local/ganglia –with-librrd=/usr/local/rrdtool –with-libapr=/usr/local/apr –with-libexpat=/usr/local/expat –with-libconfuse=/usr/local/confuse –enable-gexec –enable-status --sysconfdir=/etc/ganglia

make

make install

（4）将gmond加入系统服务

cp {ganglia解压目录}gmond/gmond.init /etc/init.d/gmond

（5）将gmond命令加入到/usr/sbin/目录

cp /usr/local/ganglia/sbin/gmond /usr/sbin/

（9）生成gmond的默认配置文件

gmond --default_config>/etc/ganglia/gmond.conf

（10）将gmond给 chkconfig托管

chkconfig --add gmond

（11）修改gmond配置文件/etc/ganglia/gmond.conf

cluster {

name = "test cluster"

owner = "nobody"

latlong = "unspecified"

url = "unspecified"

}

（12）开启gmond服务

service gmond start

Starting GANGLIA gmetad: [ OK ]

（6）见到[OK]字样说明启动成功。

2.4. 安装ganglia web页面

2.4.1. 安装php

（1）安装php-common

rpm -ivh /root/CentOS/php-common-5.1.6-27.el5.x86_64.rpm

（2）安装php-cli

rpm -ivh /root/CentOS/php-cli-5.1.6-27.el5.x86_64.rpm

（3）安装php

rpm -ivh /root/CentOS/php-5.1.6-27.el5.x86_64.rpm

（4）安装php-gd

rpm -ivh /root/CentOS/php-gd-5.1.6-27.el5.x86_64.rpm

2.4.2. 构建web服务器

（1）创建ganglia的web目录

mkdir /var/www/html/ganglia

（2）拷贝ganglia的web目录下文件到httpd服务器

cp -a {ganglia解压目录}/web/* /var/www/html/ganglia/

（3）禁用 SELinux

setenforce 0 （如果不禁用会报错：Forbidden，You don’t have permission to access /ganglia/ on this server）

（4）修改conf.php为ganglia指定rrdtool命令路径

define("RRDTOOL", "/usr/local/rrdtool/bin/rrdtool");

（5）重启httpd服务器

service httpd restart

（6）查看web页面

http://web服务器ip/ganglia/

3. 集群式部署

集群的部署依靠配置文件：

Ø /etc/ganglia/gmetad.conf

这里需要配置的是集群名称和server端地址

data_source "my cluster" localhost

Ø /etc/ganglia/gmond.conf

这里需要配置cluster下的name和owner属性与gmetad端一致

cluster {

name = "my cluster"

owner = "nobody"

latlong = "unspecified"

url = "unspecified"

}

Ø 如果修改owner，同时也要修改Rrdtool的数据文件目录的owner：

chown -R nobody:nobody /var/lib/ganglia/rrds

4. 安装过程中遇到的问题

Ø 安装rrdtool过程中提示pkg-config错误

一般是库没有装全，记得要安装dev库

Ø 安装ganglia过程中找不到confuse，expat，apr等库

如果是64位机，ganglia会在依赖库的lib64目录下去查找文件，所以需要将lib目录下文件拷贝到lib64目录下

Ø gmetad启动失败：

查看gmetad状态：service gmetad status

gmetad dead but subsys locked

在gmetad.conf中修改了用户

setuid_username "root" 这里需要username和rrd数据目录的owner一致

Ø gmond启动失败：

查看gmond状态：service gmond status

gmond dead but subsys locked

可能是没有配置网关或者cluster的owner配置有问题

Ø 访问网页保错：

Ganglia cannot find a data source. Is gmond running?

There was an error collecting ganglia data XML error : Invalid document

现象是页面只有一行错误信息

解决办法是修改conf.php中的$ganglia_ip = "127.0.0.1";这里不能使用机器的ip，原因还不清楚。

Ø 图片无法显示：

一般是php-gd没有安装或者/var/www/html/ganglia/conf.php中rrdtool的配置有问题

Ø 访问网页保错Forbidden，You don’t have permission to access /ganglia/ on this server

是SELinux没有禁用

http://tech.sosidc.com/home/index/linux/id/55160.html

今天和朋友聊起ganglia来，他想做性能分析，一开始选择是zabbix了，后来改成ganglia了，我的回答是 ganglia更适合做这些东西。好多朋友不会ganglia，甚至没有接触过ganglia，这样我就先简单介绍下ganglia。

与Cacti、Nagios、Zabbix等工具相比，Ganglia 更多地与收集度量数据并随时跟踪这些数据有关，可以用于集群的性能监控、分析和优化。

Ganglia就是这样一种工具。Ganglia 是 UC Berkeley 发起的一个开源监视项目，设计用于测量数以千计的节点。Ganglia主要监控集群的性能指标，如cpu 、mem、硬盘利用率， I/O负载、网络流量情况等，也可以监控自定义的性能指标。通过Ganglia绘制的曲线很容易见到每个节点的工作状态，对合理调整、分配系统资源，提高系统整体性能起到重要作用。

还有就是ganglia客户端gmond 带来的系统负载非常少，这使得它成为在集群中各台计算机上运行的一段代码，而不会影响用户性能。

上面话有些官方：
我自己的理解是，让ganglia收集性能信息，更好的做服务性能的分析。

ganglia的框架：

每个被检测的节点或集群运行一个gmond进程，进行监控数据的收集、汇总和发送。gmond即可以作为发送者（收集本机数据），也可以作为接收者（汇总多个节点的数据）。
通常在整个监控体系中只有一个gmetad进程。该进程定期检查所有的gmonds，主动收集数据，并存储在RRD存储引擎中。
ganglia-web是使用php编写的web界面，以图表的方式展现存储在RRD中的数据。通常与gmetad进程运行在一起。

对于ganglia分布式的理解，可以看下面的图～

在安装前先说明下，ganglia的安装会遇见各种各样的问题的，遇到问题可以搜，也可以找我问的。
要是大家没有这个耐心，可以先用Ubuntu跑一个

在Ubuntu上安装Ganglia非常简单，首先安装下面三个包。因为要使用Web服务器才能看到图表，所以如果没有安装apache的话，会自动安装apache服务器。

sudo apt-get install ganglia-monitor ganglia-webfront

安装完成之后，gmetad和gmond两个服务会运行起来，使用下面的命令可以启动这两个服务。

sudo service gmetad start
sudo service ganglia-monitor start
sudo mv /usr/share/ganglia-webfrontend/ /var/www/ganglia

然后访问就行了！！！

下面是CentOS的安装过程 ~

#基础的开发包

yum -y install ntp make openssl openssl-devel pcre pcre-devel libpng
libpng-devel libjpeg-6b libjpeg-devel-6b freetype freetype-devel gd gd-devel zlib zlib-devel
gcc gcc-c++ libXpm libXpm-devel ncurses ncurses-devel libmcrypt libmcrypt-devel libxml2
libxml2-devel imake autoconf automake screen sysstat compat-libstdc++-33 curl curl-devel

#安装lamp环境：

yum -y install httpd mysql mysql-server mysql-connector-odbc php php-mysql php-common php-pdo apr apr-util* pcre pcre-devel

wget ftp://ftp.univie.ac.at/systems/linux/dag/RedHat/el5/en/x86_64/dag/RPMS/libconfuse-2.6-2.el5.rf.x86_64.rpm
wget ftp://ftp.univie.ac.at/systems/linux/dag/redhat/el5/en/x86_64/dag/RPMS/libconfuse-devel-2.6-2.el5.rf.x86_64.rpm
rpm -ivh libconfuse*.rpm

#安装绘图工具：rrdtool

cd /root/tools
wget http://oss.oetiker.ch/rrdtool/pub/rrdtool-1.4.5.tar.gz
tar zvxf rrdtool-1.4.5.tar.gz
cd rrdtool-1.4.5
./configure –prefix=/usr/local/rrdtool
make && make install
cd ..

#推荐大家直接用

yum -y install rrdtool
ln -s /usr/local/rrdtool/include/rrd.h /usr/include/rrd.h
ln -s /usr/local/rrdtool/lib/librrd.a /usr/lib/librrd.a

#大家也可以用yum的安装方法，yum的话，相对简单了很多！！！epel源的ganglia是3.1的版本，要是想要最新的可以源码的安装
#安装ganglia

wget http://cdnetworks-kr-2.dl.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.2.0/ganglia-3.2.0.tar.gz
tar zxvf ganglia-3.2.0.tar.gz
cd ganglia-3.2.0
./configure –prefix=/usr/local/ganglia –sysconfdir=/usr/local/ganglia –with-gmetad
make && make install
cd ..

客户端的安装就简单了~

./configure
make
make install

#这样就行了，要是有报错，大家搜搜问题所在

#网站目录

cd /root/tools/ganglia-3.2.0
cp -Rf web /var/www/html/ganglia
chown -R apache.apache /var/www/html/ganglia

#修改RRDTOOL的路径：
vi /var/www/html/ganglia/conf.php
修改约33行成如下：
which rrdtool看看rrdtool的位置是在那里
33 define("RRDTOOL", "/usr/local/rrdtool/bin/rrdtool");

#生成启动程序

cp gmetad/gmetad.init /etc/rc.d/init.d/gmetad
cp gmond/gmond.init /etc/rc.d/init.d/gmond

#开机自启动
chkconfig gmetad on
chkconfig gmond on

#被监控端的配置文件
gmond/gmond -t | tee /usr/local/ganglia/gmond.conf

#生成服务器端的配置文件
cp gmetad/gmetad.conf /usr/local/ganglia/

#存放rrdtool数据的目录

mkdir -p /var/lib/ganglia/rrds
mkdir -p /var/lib/ganglia/dwoo/
chown apache:apache /var/lib/ganglia/dwoo
chown apache:apache /var/lib/ganglia/rrds

posted @ 2015-02-04 17:12 陳聽溪阅读(484) 评论(0) 收藏举报

刷新页面返回顶部

陳聽溪

ganglia自定义图表

Ganglia 扩展之 Python 实现方法

1. Ganglia 简介

2. Ganglia 扩展能力

3. 系统准备

4. Python 扩展的实现

1) 实例描述

2) 需要做的工作

3) 修改配置文件

4) 编写模块代码

5) 增加统计表

a) 修改 $GANGLIA_WEB/conf.php 文件

b) 编写 $GANGLIA_WEB/graph.d/random_report.php 文件

6) Ganglia 自定义 Web 模板

a) 修改 $GANGLIA_WEB/host_view.php

b) 修改 $GANGLIA_WEB/ templates/onest/host_view.tpl

c) 修改 $GANGLIA_WEB/conf.php

d) 重启 httpd 服务

e) 自定义显示

附录 1 ：

Ganglia监控系统--自定义插件

【ganglia】 ganglia扩展之gmetric and Python实现方法

6 个回答

IT管理员常用的管理、运维工具有哪些？修改

14 个回答

陈湛翀，从事运维工作

饶琛琳，DevOps / Perl / 格律诗词

dccmx，搞技术的

伍卫民，不纯粹的互联网运维

胡志伟，运维 linux

蓝天

张涤凡，养猪专家

xia luke，云计算/IT管理/市场营销/SaaS/睡觉爱好者

刘生，财富人生

杨宇，戴尔公司，IT在于折腾

宋明明，走理想主义道路的现实主义者

Sun Kairong，运维工程师

nowo，知之为知之 不知为不知

何晓阳，北京蓝海讯通科技有限公司 CEO

云策画平台经管的三大利器Nagios、Ganglia和Splunk

1. 安装rrdtool

1.1. 安装依赖包

1.2. 安装rrdtool

2. 安装ganglia

2.1. 安装依赖软件

2.2. 安装ganglia管理端（gmetad）

2.3. 安装ganglia节点（gmond）

2.4. 安装ganglia web页面

2.4.1. 安装php

2.4.2. 构建web服务器

3. 集群式部署

4. 安装过程中遇到的问题

公告

nowo，知之为知之不知为不知