Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application. --Erlang/OTP Doc
Supervisor的基本思想就是通过建立层级结构实现错误隔离和管理,具体方法是通过重启的方式保持子进程一直活着.如果supervisor是进程树的一部分,它会被它的supervisor自动终止,当它的supervisor让它shutdown的时候,它会按照子进程启动顺序的逆序终止其所有的子进程,最后终止掉自己.[Erlang 0025]理解Erlang/OTP - Application以log4erl项目为学习了Erlang/OTP application,我们说到application在start的方法中启动了log4erl的顶层监控树.今天我们继续跟进,看log4erl的监控树是怎么构建起来的,并做实验看supervisor如何通过重启恢复服务的.使用application:start(log4erl).启动起来之后的进程树:

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的执行是同步的,直到所有的子进程都启动了才会返回. supervisor:start_link会使用回调函数init/1.
start_link(Default_logger) -> R = supervisor:start_link({local, ?MODULE}, ?MODULE, []), %log4erl:start_link(Default_logger), add_logger(Default_logger), ?LOG2("Result in supervisor is ~p~n",[R]), R.
%%回调的方法init/1 init([]) -> {ok, { {one_for_one,3,10}, [] } }.
log4erl的顶层监控树的初始化相当简单仅仅定义了重启策略(RestartStrategy)和最大重启频率(maximum restart frequency):{one_for_one,3,10}.
{one_for_one,3,10}表达的语义是{How, Max, Within}:在多长时间内(Within)重启了几次(Max),如何重启(HOW 重启策略);设计最大重启频率是为了避免反复重启进入死循环,一旦超出了此阈值,supervisor进程会结束掉自己以及它所有的子进程,并通过进程树传递退出消息,更上层的supervisor就会采取适当的措施,要么重启终止的supervisor要么自己也终止掉.可能比较纠结这几个值怎么配置,多数资料上都会告诉你"如何配置完全取决于你的应用程序".这个还是有经验值的,生成环境的经验值是一小时内重启4次,也可以参考一些和你应用类似的开源项目看看它们是怎么配置的.如果填写的是{one_for_one,0,1}就是不允许重启,下面的示例中可以看到YAWS项目采用了这样的策略.
下面几个开源项目顶层supervisor的init方法:
%%rabbit_sup.erl 来自大名鼎鼎的rabbitmq init([]) -> {ok, {{one_for_all, 0, 1}, []}}.
%%yaws_sup.erl Yaws项目 - Yet Another Web Server init([]) ->
ChildSpecs = child_specs(), %% 0, 1 means that we never want supervisor restarts {ok,{{one_for_all, 0, 1}, ChildSpecs}}.
%%ejabberd_sup ejabberd项目 init([]) -> Hooks = {ejabberd_hooks, {ejabberd_hooks, start_link, []}, %%......................... 省略代码 {ok, {{one_for_one, 10, 1}, [Hooks, GlobalRouter, Cluster, .................. Listener]}}.
重启策略
one-for-one : 把子进程当成各自独立的,一个进程出现问题其它进程不会受到崩溃的进程的影响.该子进程死掉,只有这个进程会被重启
one_for_all : 如果子进程终止,所有其它子进程也都会被终止,然后所有进程都会被重启.
rest_for_one:如果一个子进程终止,在这个进程启动之后启动的进程都会被终止掉.然后终止掉的进程和连带关闭的进程都会被重启.
simple_one_for_one 是one_for_one的简化版 ,所有子进程都动态添加同一种进程的实例
在log4erl_sup.erl的start_link中启动了顶层supervisor之后,添加了一个默认的logger: add_logger(Default_logger),
add_logger(Name) when is_atom(Name) -> N = atom_to_list(Name), add_logger(N); add_logger(Name) when is_list(Name) -> C1 = {Name, {log_manager, start_link ,[Name]}, permanent, 10000, worker, [log_manager]}, ?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]), supervisor:start_child(?MODULE, C1).
添加的logger是log4erl_sup的子进程,子进程启动和监控的方式通过child specification来指定.
C1 = {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
C1的六个数据项分别为: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID :supervisor 用来在内部区分specification的,所以只要子进程规格说明之间不重复就可以.
Start : 启动参数{M,F,A} Restart : 这个进程遇到错误之后是否重启 permanent:遇到任何错误导致进程终止就会重启 temporary:进程永远都不会被重启 transient: 只有进程异常终止的时候会被重启 Shutdown 进程如何被干掉,这里是使用整型值2000的意思是,进程在被强制干掉之前有2000毫秒的时间料理后事自行终止. 实际过程是supervisor给子进程发送一个exit(Pid,shutdown)然后等待exit信号返回,在指定时间没有返回则将子进程使用exit(Child,kill) 这里的参数还有 brutal_kill 意思是进程马上就会被干掉 infinity :当一个子进程是supervisor那么就要用infinity,意思是给supervisor足够的时间进行重启. Type 这里只有两个值:supervisor worker ; 只要没有实现supervisor behavior的进程都是worker;
可以通过supervisor的层级结构来精细化对进程的控制.这个值主要作用是告知监控进程它的子进程是supervisor还是worker Modules 是进程依赖的模块, 这个信息只有在代码热更新的时候才会被用到:标注了哪些模块需要按照什么顺序进行更新;通常这里只需要列出进程依赖的主模块. 如果子进程是supervisor gen_server gen_fsm Module名是回调模块的名称,这时Modules的值是只有一个元素的列表,元素就是回调模块的名称;如果子进程是gen_event Modules的值是 dynamic;关于dynamic参数余锋有一篇专门的分析:Erlang supervisor规格的dynamic行为分析 http://blog.yufeng.info/archives/1455
实际应用中log4erl中的logger会根据业务逻辑添加多个,我们也不是直接通过application:start(log4erl).而是调用 log4erl:conf(log4erl.conf)这个方法简单的封装了内部逻辑,实际调用的是 log4erl_conf:conf(File).我们这里定义一个简单的log4erl.conf文件,使用log4erl:conf(log4erl.conf).启动之后,我们看看它的进程树是什么样的:
%%log4erl.conf文件 内容我做了简单的缩排
%%mod
logger default_logger{
file_appender default_app{
dir = "./log", level = debug, file = default_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
}
}
%%mail mod
logger mail_logger{
file_appender mail_app{
dir = "./log", level = debug, file = mail_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
}
}
对应的进程树是这样的,进程之间的红线表示link关系:

我们沿着调用关系,逐步跟进代码:
%==== File : log4erl_conf =======
%%log4erl_conf:conf(File). conf(File) -> application:start(log4erl), %%启动log4erl Tree = parse(leex(File)), %%解析配置文件 traverse(Tree). %%遍历配置项构造监控树
%%跟进遍历的逻辑,对于每一条配置执行的是element/1方法 traverse([]) -> ok; traverse([H|Tree]) -> element(H), traverse(Tree).
%%对于我们自定义的logger走的是{logger, Logger, Appenders}逻辑 element({cutoff_level, CutoffLevel}) -> log_filter_codegen:set_cutoff_level(CutoffLevel); element({default_logger, Appenders}) -> appenders(Appenders); element({logger, Logger, Appenders}) -> log4erl:add_logger(Logger), appenders(Logger, Appenders).
%==== File : log4erl ======= %%继续跟进我们走到log4erl:add_logger/1 add_logger(Logger) -> try_msg({add_logger, Logger}).
%%try_msg 是的添加了异常捕获的通用方法 try_msg(Msg) -> try handle_call(Msg) catch exit:{noproc, _M} -> io:format("log4erl has not been initialized yet. To do so, please run~n"), io:format("> application:start(log4erl).~n"), {error, log4erl_not_started}; E:M -> ?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]), {E, M} end.
%%handle_call的代码片段 handle_call({add_logger, Logger}) -> log_manager:add_logger(Logger);
%==== File : log_manager ======= %%逻辑转到log_manager的add_logger(Logger) %%最终调用的是log4erl_sup:add_logger(Logger).这个我们上面已经分析过了 add_logger(Logger) -> log4erl_sup:add_logger(Logger).
%%element方法在添加loger之后会添加appender appenders([]) -> ok; appenders([H|Apps]) -> appender(H), appenders(Apps).
appenders(_, []) -> ok; appenders(Logger, [H|Apps]) -> appender(Logger, H), appenders(Logger, Apps).
appender({appender, App, Name, Conf}) -> log4erl:add_appender({App, Name}, {conf, Conf}).
appender(Logger, {appender, App, Name, Conf}) -> log4erl:add_appender(Logger, {App, Name}, {conf, Conf}).
%==== File : log4erl ======= %% Appender = {Appender, Name} add_appender(Logger, Appender, Conf) -> try_msg({add_appender, Logger, Appender, Conf}).
handle_call({add_appender, Logger, Appender, Conf}) -> log_manager:add_appender(Logger, Appender, Conf);
%==== File : log_manager ======= add_appender(Logger, {Appender, Name} , Conf) -> ?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]), log4erl_sup:add_guard(Logger, Appender, Name, Conf).
%==== File : log4erl_sup ======= add_guard(Logger, Appender, Name, Conf) -> C = {Name, {logger_guard, start_link ,[Logger, Appender, Name, Conf]}, permanent, 10000, worker, [logger_guard]}, ?LOG2("Adding ~p to ~p~n",[C, ?MODULE]), supervisor:start_child(?MODULE, C).
%==== File : logger_guard ======= start_link(Logger, Appender, Name, Conf) -> %?LOG2("starting guard for logger ~p~n",[Logger]), {ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []), case add_sup_handler(Pid, Logger, Conf) of {error, E} -> gen_server:call(Pid, stop), {error, E}; _R -> {ok, Pid} end.
add_sup_handler(G_pid, Logger, Conf) -> ?LOG("add_sup()~n"), gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) -> ?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]), try Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf), {reply, Res, State} catch E:R -> {reply, {error, {E,R}}, State} end;
gen_event:add_sup_handler会建立EventManager与Event Handler之间的link的关系,所以我们修改一下,注释掉这段,看看监控树是什么样子:
add_sup_handler(G_pid, Logger, Conf) ->
% ?LOG("add_sup()~n"), % gen_server:call(G_pid, {add_sup_handler, Logger, Conf}). ok.
注释掉之后可以看到logger和guard之间的link关系就不存在了.

kill进程的实验
后面我们会用各种情况杀掉进程,看这个进程树对异常的处理情况;我们的实验步骤:
1.发送退出消息Reason:some_reason给default_logger
2.发送退出消息Reason:kill 给default_logger
3.发送退出消息Reason:some_reason给logger_guard
4.发送退出消息Reason:some_reason给log4erl_sup
5.发送退出消息Reason:kill 给log4erl_sup
3> whereis(default_logger). <0.45.0> 4> exit(whereis(default_logger),some_reason). true 5> whereis(default_logger). <0.45.0> 6> exit(whereis(default_logger),some_reason). %%由于gen_event默认process_flag(trap_exit, true),所以some_reason的退出消息并没有把它干掉 true 7> whereis(default_logger). <0.45.0> 8> exit(whereis(default_logger),kill). %%向进程发送强制退出消息, true
=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先能够看到log4erl报出的子进程终止的报告 Supervisor: {local,log4erl_sup} Context: child_terminated Reason: killed Offender: [{pid,<0.45.0>}, {name,"default_logger"}, {mfargs,{log_manager,start_link,["default_logger"]}}, {restart_type,permanent}, {shutdown,10000}, {child_type,worker}]
=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新进程pid是<0.69.0> supervisor: {local,log4erl_sup} started: [{pid,<0.69.0>}, {name,"default_logger"}, {mfargs,{log_manager,start_link,["default_logger"]}}, {restart_type,permanent}, {shutdown,10000}, {child_type,worker}]
=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息转变成为killed继续广播给link的进程,对应的logger_guard终止 Supervisor: {local,log4erl_sup} Context: child_terminated Reason: killed Offender: [{pid,<0.46.0>}, {name,default_app}, {mfargs, {logger_guard,start_link, [default_logger,file_appender,default_app, {conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size}, {max,1000000},{suffix,log}, {rotation,50}, {format," %d %h:%m:%s.%i %l%n"}]}]}}, {restart_type,permanent}, {shutdown,10000}, {child_type,worker}]
=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建 supervisor: {local,log4erl_sup} started: [{pid,<0.70.0>}, {name,default_app}, {mfargs, {logger_guard,start_link, [default_logger,file_appender,default_app, {conf, [{dir,"./log"},{level,debug}, {file,default_log},{type,size}, {max,1000000}, {suffix,log},{rotation,50}, {format," %d %h:%m:%s.%i %l%n"}]}]}}, {restart_type,permanent}, {shutdown,10000}, {child_type,worker}] 9> whereis(default_logger). <0.69.0> 10> is_process_alive(pid(0,70,0)). %这是新启动的logger_guard进程 true 11> exit(pid(0,70,0),some_reason). %向进程发送一个退出消息 true
=SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 === Supervisor: {local,log4erl_sup} Context: child_terminated Reason: some_reason Offender: [{pid,<0.70.0>}, {name,default_app}, {mfargs, {logger_guard,start_link, [default_logger,file_appender,default_app, {conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000}, {suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}}, {restart_type,permanent}, {shutdown,10000}, {child_type,worker}]
12> =PROGRESS REPORT==== 10-Jan-2012::11:07:51 === supervisor: {local,log4erl_sup} started: [{pid,<0.76.0>}, {name,default_app}, {mfargs, {logger_guard,start_link, [default_logger,file_appender,default_app, {conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000}, {suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}}, {restart_type,permanent}, {shutdown,10000},{child_type,worker}] 12> is_process_alive(pid(0,70,0)). false 13> whereis(default_logger). %退出消息广播对default_logger没有影响 <0.69.0> 14> whereis(log4erl_sup). <0.44.0> 15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的时候也会设置 process_flag(trap_exit, true), true 16> whereis(log4erl_sup). <0.44.0> 17> exit(whereis(log4erl_sup),kill). %杀掉log4erl_sup 应用程序停止 true
=CRASH REPORT==== 10-Jan-2012::13:26:23 === crasher: initial call: gen_event:init_it/6 pid: <0.69.0> registered_name: default_logger exception exit: killed in function gen_event:terminate_server/4 ancestors: [log4erl_sup,<0.43.0>] messages: [{'EXIT',<0.76.0>,killed}] links: [#Port<0.1891>,#Port<0.1885>] dictionary: [] trap_exit: true status: running heap_size: 610 stack_size: 24 reductions: 720 neighbours: 18> =CRASH REPORT==== 10-Jan-2012::13:26:23 === crasher: initial call: gen_event:init_it/6 pid: <0.47.0> registered_name: mail_logger exception exit: killed in function gen_event:terminate_server/4 ancestors: [log4erl_sup,<0.43.0>] messages: [{'EXIT',<0.48.0>,killed}] links: [#Port<0.546>] dictionary: [] trap_exit: true status: running heap_size: 377 stack_size: 24 reductions: 411 neighbours: 18> =CRASH REPORT==== 10-Jan-2012::13:26:25 === crasher: initial call: application_master:init/4 pid: <0.42.0> registered_name: [] exception exit: killed in function application_master:terminate/2 ancestors: [<0.41.0>] messages: [] links: [<0.6.0>] dictionary: [] trap_exit: true status: running heap_size: 610 stack_size: 24 reductions: 1555 neighbours: 18> =INFO REPORT==== 10-Jan-2012::13:26:25 === application: log4erl exited: killed type: temporary 18>
最后再贴一次log4erl项目的地址: http://code.google.com/p/log4erl/,建议下载下来代码自己动手做一下上面的实验.
下一篇:[Erlang 0031] Erlang Shell中的输出完整数据
|