Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application.
                                                                                                                                                                                     --Erlang/OTP Doc

     Supervisor的基本思想就是通过建立层级结构实现错误隔离和管理,具体方法是通过重启的方式保持子进程一直活着.如果supervisor是进程树的一部分,它会被它的supervisor自动终止,当它的supervisor让它shutdown的时候,它会按照子进程启动顺序的逆序终止其所有的子进程,最后终止掉自己.重启的目的是让系统回归到一个稳定的状态,回归稳定状态后再出现异常可以进行重试,如果初始化都不稳定,后续的监控-重启策略意义不大.换句话说,Application初始化的阶段要有可靠性的保障,初始化阶段可能读取配置文件或者从数据库加载恢复数据,哪怕执行时间长一点都等待同步执行完.如果application依赖非本地数据库或外部服务就可以采取更快的异步启动,因为这种服务在正常使用过程中也经常出状况,早一点还是晚一点启动没有什么关系.

   [Erlang 0025]理解Erlang/OTP - Application以log4erl项目为学习了Erlang/OTP application,我们说到application在start的方法中启动了log4erl的顶层监控树.今天我们继续跟进,看log4erl的监控树是怎么构建起来的,并做实验看supervisor如何通过重启恢复服务的.使用application:start(log4erl).启动起来之后的进程树:

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的执行是同步的,直到所有的子进程都启动了才会返回. supervisor:start_link会使用回调函数init/1.

start_link(Default_logger) ->
R = supervisor:start_link({local, ?MODULE}, ?MODULE, []),
%log4erl:start_link(Default_logger),
add_logger(Default_logger),
?LOG2("Result in supervisor is ~p~n",[R]),
R.

%%回调的方法init/1
init([]) ->
{ok,
{
{one_for_one,3,10},
[]
}
}.
  log4erl的顶层监控树的初始化相当简单仅仅定义了重启策略(RestartStrategy)和最大重启频率(maximum restart frequency):{one_for_one,3,10}.
 {one_for_one,3,10}表达的语义是{How, Max, Within}:在多长时间内(Within)重启了几次(Max),如何重启(HOW 重启策略);设计最大重启频率是为了避免反复重启进入死循环,一旦超出了此阈值,supervisor进程会结束掉自己以及它所有的子进程,并通过进程树传递退出消息,更上层的supervisor就会采取适当的措施,要么重启终止的supervisor要么自己也终止掉.可能比较纠结这几个值怎么配置,多数资料上都会告诉你"如何配置完全取决于你的应用程序".这个还是有经验值的,生成环境的经验值是一小时内重启4次,也可以参考一些和你应用类似的开源项目看看它们是怎么配置的.如果填写的是{one_for_one,0,1}就是不允许重启,下面的示例中可以看到YAWS项目采用了这样的策略.
 
下面几个开源项目顶层supervisor的init方法:
%%rabbit_sup.erl  来自大名鼎鼎的rabbitmq
init([]) ->
{ok, {{one_for_all, 0, 1}, []}}.


%%yaws_sup.erl Yaws项目 - Yet Another Web Server
init([]) ->

ChildSpecs = child_specs(),
%% 0, 1 means that we never want supervisor restarts
{ok,{{one_for_all, 0, 1}, ChildSpecs}}.


%%ejabberd_sup ejabberd项目
init([]) ->
Hooks =
{ejabberd_hooks,
{ejabberd_hooks, start_link, []},
%%......................... 省略代码
{ok, {{one_for_one, 10, 1},
[Hooks,
GlobalRouter,
Cluster,
..................
Listener]}}.
 
重启策略
 
 one_for_one : 把子进程当成各自独立的,一个进程出现问题其它进程不会受到崩溃的进程的影响.该子进程死掉,只有这个进程会被重启
 one_for_all : 如果子进程终止,所有其它子进程也都会被终止,然后所有进程都会被重启.
 rest_for_one:如果一个子进程终止,在这个进程启动之后启动的进程都会被终止掉.然后终止掉的进程和连带关闭的进程都会被重启.
 simple_one_for_one 是one_for_one的简化版 ,所有子进程都动态添加同一种进程的实例

 one-for-one维护了一个按照启动顺序排序的子进程列表,而simple_one_for_one 由于所有的子进程都是同样的(相同的MFA),使用的是字典来维护子进程信息;
Note: one of the big differences between one_for_one and simple_one_for_one is that one_for_one holds a list of all the children it has (and had, if you don't clear it), started in order, while simple_one_for_one holds a single definition for all its children and works using a dict to hold its data. Basically, when a process crashes, the simple_one_for_one supervisor will be much faster when you have a large number of children.
 
Note: it is important to note that simple_one_for_one children are not respecting this rule with the Shutdown time. In the case of simple_one_for_one, the supervisor will just exit and it will be left to each of the workers to terminate on their own, after their supervisor is gone.
 
For the most part, writing a simple_one_for_one supervisor is similar to writing any other type of supervisor, except for one thing. The argument list in the {M,F,A} tuple is not the whole thing, but is going to be appended to what you call it with when you dosupervisor:start_child(Sup, Args). That's right, supervisor:start_child/2 changes API. So instead of doing supervisor:start_child(Sup, Spec), which would call erlang:apply(M,F,A), we now havesupervisor:start_child(Sup, Args), which calls erlang:apply(M,F,Args++A).

在log4erl_sup.erl的start_link中启动了顶层supervisor之后,添加了一个默认的logger:
 add_logger(Default_logger),
add_logger(Name) when is_atom(Name) ->
N = atom_to_list(Name),
add_logger(N);
add_logger(Name) when is_list(Name) ->
C1 = {Name,
{log_manager, start_link ,[Name]},
permanent,
10000,
worker,
[log_manager]},

?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]),
supervisor:start_child(?MODULE, C1).
添加的logger是log4erl_sup的子进程,子进程启动和监控的方式通过child specification来指定.
 C1 =  {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
 
C1的六个数据项分别为: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID :supervisor 用来在内部区分specification的,所以只要子进程规格说明之间不重复就可以.
Start : 启动参数{M,F,A}
Restart : 这个进程遇到错误之后是否重启
                permanent:遇到任何错误导致进程终止就会重启
                temporary:进程永远都不会被重启
                transient: 只有进程异常终止的时候会被重启
Shutdown 进程如何被干掉,这里是使用整型值2000的意思是,进程在被强制干掉之前有2000毫秒的时间料理后事自行终止.
              实际过程是supervisor给子进程发送一个exit(Pid,shutdown)然后等待exit信号返回,在指定时间没有返回则将子进程使用exit(Child,kill)
             这里的参数还有 brutal_kill 意思是进程马上就会被干掉
             infinity :当一个子进程是supervisor那么就要用infinity,意思是给supervisor足够的时间进行重启.
Type 这里只有两个值:supervisor worker ; 只要没有实现supervisor behavior的进程都是worker;
                    可以通过supervisor的层级结构来精细化对进程的控制.这个值主要作用是告知监控进程它的子进程是supervisor还是worker
Modules 是进程依赖的模块,这个信息只有在代码热更新的时候才会被用到:标注了哪些模块需要按照什么顺序进行更新;通常这里只需要列出进程依赖的主模块. 如果子进程是supervisor gen_server gen_fsm Module名是回调模块的名称,这时Modules的值是只有一个元素的列表,元素就是回调模块的名称;如果子进程是gen_event Modules的值是 dynamic;关于dynamic参数余锋有一篇专门的分析:Erlang supervisor规格的dynamic行为分析 http://blog.yufeng.info/archives/1455
 
Modules is a list of one element, the name of the callback module used by the child behavior. The exception to that is when you have callback modules whose identity you do not know beforehand (such as event handlers in an event manager). In this case, the value of Modules should be dynamic so that the whole OTP system knows who to contact when using more advanced features, such as releases.
 
 
实际应用中log4erl中的logger会根据业务逻辑添加多个,我们也不是直接通过application:start(log4erl).而是调用  log4erl:conf(log4erl.conf)这个方法简单的封装了内部逻辑,实际调用的是 log4erl_conf:conf(File).我们这里定义一个简单的log4erl.conf文件,使用log4erl:conf(log4erl.conf).启动之后,我们看看它的进程树是什么样的:
%%log4erl.conf文件 内容我做了简单的缩排
%%mod
logger default_logger{
     file_appender default_app{
    dir = "./log", level = debug, file = default_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}

%%mail mod
logger mail_logger{
     file_appender mail_app{
     dir = "./log", level = debug, file = mail_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}
对应的进程树是这样的,进程之间的红线表示link关系:

我们沿着调用关系,逐步跟进代码:

%==== File : log4erl_conf =======

%%log4erl_conf:conf(File).
conf(File) ->
application:start(log4erl), %%启动log4erl
Tree = parse(leex(File)), %%解析配置文件
traverse(Tree). %%遍历配置项构造监控树

%%跟进遍历的逻辑,对于每一条配置执行的是element/1方法
traverse([]) ->
ok;
traverse([H|Tree]) ->
element(H),
traverse(Tree).

%%对于我们自定义的logger走的是{logger, Logger, Appenders}逻辑
element({cutoff_level, CutoffLevel}) ->
log_filter_codegen:set_cutoff_level(CutoffLevel);
element({default_logger, Appenders}) ->
appenders(Appenders);
element({logger, Logger, Appenders}) ->
log4erl:add_logger(Logger),
appenders(Logger, Appenders).

%==== File : log4erl =======
%%继续跟进我们走到log4erl:add_logger/1
add_logger(Logger) ->
try_msg({add_logger, Logger}).

%%try_msg 是的添加了异常捕获的通用方法
try_msg(Msg) ->
try
handle_call(Msg)
catch
exit:{noproc, _M} ->
io:format("log4erl has not been initialized yet. To do so, please run~n"),
io:format("> application:start(log4erl).~n"),
{error, log4erl_not_started};
E:M ->
?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]),
{E, M}
end.

%%handle_call的代码片段
handle_call({add_logger, Logger}) ->
log_manager:add_logger(Logger);

%==== File : log_manager =======
%%逻辑转到log_manager的add_logger(Logger)
%%最终调用的是log4erl_sup:add_logger(Logger).这个我们上面已经分析过了
add_logger(Logger) ->
log4erl_sup:add_logger(Logger).

%%element方法在添加loger之后会添加appender
appenders([]) ->
ok;
appenders([H|Apps]) ->
appender(H),
appenders(Apps).

appenders(_, []) ->
ok;
appenders(Logger, [H|Apps]) ->
appender(Logger, H),
appenders(Logger, Apps).

appender({appender, App, Name, Conf}) ->
log4erl:add_appender({App, Name}, {conf, Conf}).

appender(Logger, {appender, App, Name, Conf}) ->
log4erl:add_appender(Logger, {App, Name}, {conf, Conf}).


%==== File : log4erl =======
%% Appender = {Appender, Name}
add_appender(Logger, Appender, Conf) ->
try_msg({add_appender, Logger, Appender, Conf}).

handle_call({add_appender, Logger, Appender, Conf}) ->
log_manager:add_appender(Logger, Appender, Conf);

%==== File : log_manager =======
add_appender(Logger, {Appender, Name} , Conf) ->
?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]),
log4erl_sup:add_guard(Logger, Appender, Name, Conf).

%==== File : log4erl_sup =======
add_guard(Logger, Appender, Name, Conf) ->
C = {Name,
{logger_guard, start_link ,[Logger, Appender, Name, Conf]},
permanent,
10000,
worker,
[logger_guard]},
?LOG2("Adding ~p to ~p~n",[C, ?MODULE]),
supervisor:start_child(?MODULE, C).

%==== File : logger_guard =======
start_link(Logger, Appender, Name, Conf) ->
%?LOG2("starting guard for logger ~p~n",[Logger]),
{ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []),
case add_sup_handler(Pid, Logger, Conf) of
{error, E} ->
gen_server:call(Pid, stop),
{error, E};
_R ->
{ok, Pid}
end.

add_sup_handler(G_pid, Logger, Conf) ->
?LOG("add_sup()~n"),
gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).

handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) ->
?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]),
try
Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf),
{reply, Res, State}
catch
E:R ->
{reply, {error, {E,R}}, State}
end;

gen_event:add_sup_handler会建立EventManager与Event Handler之间的link的关系,所以我们修改一下,注释掉这段,看看监控树是什么样子:

add_sup_handler(G_pid, Logger, Conf) ->

%    ?LOG("add_sup()~n"),
%    gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
  ok.

注释掉之后可以看到logger和guard之间的link关系就不存在了.

 
kill进程的实验
 
后面我们会用各种情况杀掉进程,看这个进程树对异常的处理情况;我们的实验步骤:
1.发送退出消息Reason:some_reason给default_logger
2.发送退出消息Reason:kill 给default_logger
3.发送退出消息Reason:some_reason给logger_guard
4.发送退出消息Reason:some_reason给log4erl_sup
5.发送退出消息Reason:kill 给log4erl_sup
3> whereis(default_logger).
<0.45.0>
4> exit(whereis(default_logger),some_reason).
true
5> whereis(default_logger).
<0.45.0>
6> exit(whereis(default_logger),some_reason). %%由于gen_event默认process_flag(trap_exit, true),所以some_reason的退出消息并没有把它干掉
true
7> whereis(default_logger).
<0.45.0>
8> exit(whereis(default_logger),kill). %%向进程发送强制退出消息,
true

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先能够看到log4erl报出的子进程终止的报告
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.45.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新进程pid是<0.69.0>
supervisor: {local,log4erl_sup}
started: [{pid,<0.69.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息转变成为killed继续广播给link的进程,对应的logger_guard终止
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.46.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},
{max,1000000},{suffix,log}, {rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建
supervisor: {local,log4erl_sup}
started: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug}, {file,default_log},{type,size},
{max,1000000}, {suffix,log},{rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
9> whereis(default_logger).
<0.69.0>
10> is_process_alive(pid(0,70,0)). %这是新启动的logger_guard进程
true
11> exit(pid(0,70,0),some_reason). %向进程发送一个退出消息
true

=SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 ===
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: some_reason
Offender: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

12>
=PROGRESS REPORT==== 10-Jan-2012::11:07:51 ===
supervisor: {local,log4erl_sup}
started: [{pid,<0.76.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},{child_type,worker}]
12> is_process_alive(pid(0,70,0)).
false
13> whereis(default_logger). %退出消息广播对default_logger没有影响
<0.69.0>
14> whereis(log4erl_sup).
<0.44.0>
15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的时候也会设置 process_flag(trap_exit, true),
true
16> whereis(log4erl_sup).
<0.44.0>
17> exit(whereis(log4erl_sup),kill). %杀掉log4erl_sup 应用程序停止
true

=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.69.0>
registered_name: default_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.76.0>,killed}]
links: [#Port<0.1891>,#Port<0.1885>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 720
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.47.0>
registered_name: mail_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.48.0>,killed}]
links: [#Port<0.546>]
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 411
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:25 ===
crasher:
initial call: application_master:init/4
pid: <0.42.0>
registered_name: []
exception exit: killed
in function application_master:terminate/2
ancestors: [<0.41.0>]
messages: []
links: [<0.6.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 1555
neighbours:
18>
=INFO REPORT==== 10-Jan-2012::13:26:25 ===
application: log4erl
exited: killed
type: temporary
18>


 最后再贴一次log4erl项目的地址: http://code.google.com/p/log4erl/,建议下载下来代码自己动手做一下上面的实验.

 

下一篇:[Erlang 0031] Erlang Shell中的输出完整数据