TraTraffic Server 进程模型

 

1.概述

         Traffic Server包括三个一起工作的进程来服务Traffic Server的请求,管理/控制/监控系统的健康状况。图1说明了三个进程的关系,三个进程将会在下面描述。

 

图1:进程之间的关系

1)traffic_server进程是 Traffic Server的事务处理引擎。它负责接收连接、处理协议请求以及从本地缓存或源服务器提供资源。

2)traffic_manager进程是用来命令和控制Traffic Server的工具,负责启动、监控以及重新配置traffic_server进程。traffic_manager进程同时负责代理自动配置端口、统计接口、集群管理以及vip故障转移。

         如果traffic_manager进程检测到traffic_server进程失败,它不仅会立即重启该进程,而且会为所有传入的请求维护一个连接队列。在traffic_server重启前的几秒内传入的所有连接将会被保存在连接队列中,并以FIFO的方式处理。这个连接队列接受任何server故障重启时的连接。
 

3)traffic_cop进程监控traffic_server和traffic_manager进程的健康状况。traffic_cop进程通过抓取合成web页面的心跳请求方式周期性的(每分钟若干次)查询traffic_server和traffic_manager进程。如果失败事件发生(如果在超时时间间隔内没有收到请求或者收到错误的请求),traffic_cop重启traffic_server和traffic_manager进程。系统这样设计的好处便是给traffic_server进程加上了来自traffic_manager和traffic_cop的双重保障,因为traffic_server进程是工作进程,必须保证它的正常运行。-

4)traffic server采用的是多线程异步事件处理模型:Traffic Server并不是为每个连接都建立一个线程,而是事先创建一组数量可配置的工作线程,每一个工作线程上都运行着独立的异步事件处理程序。traffic_server创建若干组Thread,并将Event按类型调度到相应的Thread的Event队列上,Thread通过执行Event对应的Continuation中的回调函数,来完成状态的迁移。从初始态到终止态的迁移代表了整个事件的执行过程,而Thread是永不退出的,等待着下一个事件的到来。

 

         本文重点在于分析traffic server中三个进程的关系以及实现,对于其多线程异步事件处理模型不作深入分析。进程模型图如下:

2.实现原理

基本原理:对traffic_manager进程和traffic_server进程分别配置对应的manager_lockfile和server_lockfile文件,traffic_cop通过两个lockfile文件来监控traffic_manager和traffic_server进程,同理traffic_manager进程通过server_lockfile来监控traffic_server进程。图2说明了这种关系:

图2:进程以及lockfile文件的关系

 

关键实现:

 

关键类 Lockfile

Lockfile::Open(pid_t * holding_pid)函数详解:

解释和说明:Lockfile::Open(pid_t * holding_pid)会有三种类型的返回值,close-on-exec:具体作用在于当开辟其他进程调用exec()族函数时,在调用exec函数之前为exec族函数释放对应的文件描述符。

(1):返回1说明lockfile可以被打开,这也说明与lockfile关联的进程没有运行,如果关联的进程在运行,lockfile会被进程持有,就不会被打开;

(2):返回0说明检测到lockfile被某个进程持有,那么将持有lockfile的进程ID写入holding_pid返回,持有lockfile的进程ID是在对应进程运行的时候,由Get()函数写入到lockfile中的;

(3):返回负值一共有三种情况,一是打开fname失败,二是获取close-on-exec标识失败,三是设置clsoe-on-exec标识失败。

 

重要的kill进程的相关函数,简要说明如下:

// kill

//用于杀死指定pid的进程

//return: 0--okay,-1—error

1.int kill(pid_t pid, int sig);

 


//ink_killall     

 

//杀死程序名称为pname的所有进程
// return: 0--okay,-1—error
 2. ink_killall(const char *pname, int sig);

ink_killall调用:

  3. ink_killall_get_pidv_xmalloc (pname, &pidv, &pidvcnt);
  4. ink_killall_kill_pidv (pidv, pidvcnt, sig);

 

 

// ink_killall_get_pidv_xmalloc
//根据程序panme,获取程序运行的进程ID到pidv数组中,以及进程的个数到pidvcnt

//变量中

//return: -1 error (pidv: set to NULL; pidvcnt: set to 0); 0 okay (pidv: ats_malloc'd //pid vector; pidvcnt: number of pid's;if pidvcnt is set to 0, then pidv will //be set to NULL)

3.int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt);

 


 
// ink_killall_kill_pidv (pidv, pidvcnt, sig);
//将pidv中记录的进程ID逐个调用kill( pidv[i],sig)
// return: 0--okay,-1—error
4.int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt,int sig);
ink_killall_kill_pidv调用:
  

  1.kill(pid_t pid, int sig);


 

// safe_kill
//用于安全的杀死程序名称为pname的所有进程,lockfile_name为进程需要关联的lockfile文件//group表明是否需要杀死pname进程创造的子进程,因为它们在同一个进程组;

//return: void

 5. static void safe_kill(const char *lockfile_name, const char *pname, bool group);
static void safe_killd调用:

  6. Lockfile::Kill(killsig, coresig, pname);

  7. Lockfile::KillGroup(killsig, coresig, pname);

 

 

 

// Lockfile::Kill

//处理好对应的lockfile文件,杀死程序名为pname的所有进程,其中sig一般就是kill信号,//initial_sig默认为0,用于发送给init_pid进程的

//return:void

6. void Lockfile::Kill(int sig, int initial_sig, const char *pname);
Lockfile::Kill调用:

  8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);



// Lockfile::KillGroup

//处理好对应的lockfile文件,杀死程序名为pname的进程,以及该进程创建的子进程(当然也包括//子进程创建的线程),sig为kill信号

//信号

//initial_sig同上kill函数

//return :void

7.void Lockfile::KillGroup(int sig, int initial_sig, const char *pname);
Lockfile::KillGroup调用:
  
  8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);

 

// LockKill::lockfile_kill_internal

//首先杀死init_pid进程,然后杀死程序名称为pname的所有进程

//return :void

8.static void lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig);
lockfile_kill_internal调用:

  1.kill(init_pid, init_sig);

  3.ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt);
  4.ink_killall_kill_pidv(pidv, pidvcnt, sig);

若想了解详细实现细节,请参见源代码.

 

 

 

2.     模拟traffic_coptraffic_managertraffic_server的监控

Traffic_cop启动以后进入main函数,main函数会调用一个check函数,在check里面会周期性的调用check_programs()函数来对traffic_manager和traffic_server进行监控。check_programs()函数有些复杂,流程图如下图。

 

3.模拟测试                                                                                   

根据原理,模仿了traffic_cop、traffic_manager和traffic_server三个进程,其中将traffic_cop实现为守护进程,traffic_manager进程对traffic-server进程的监控类似于traffic_cop对traffic_manager与traffic_server的监控,故不作重复说明。实验中,由于测试traffic_manager与traffic_server进程健康度的函数heartbear_manager()、server_up()与heartbeat_server()函数涉及到端口通信部分内容,由于其不妨碍原理部分的模拟,略写了它们的代码,而是让它们直接返回正常值。(程序运行的时候需要manage_lokfile和server_lockfile文件,读者应自己在可执行文件所在文件夹下加上这两个文件)

程序运行后,敲入命令 ps –axj|grep binary得到图如下:

前四个标识分别是:父进程ID/进程ID/进程组ID/会话ID

 

图中可以看出它们的正常关系。

当traffic_manager进程异常退出的时候,traffic_cop会重启traffic_manager进程,在日志文件中可以看出这一动作:(日志部分内容如下)

==============traffic_server is running, pid:'5443'!

----------------traffic_manager is running, pid:'5436'!

==============traffic_server is running, pid:'5443'!

---------------traffic_manager has a expcetion and eixt!

Entering check_programs()

traffic_manager not running, making sure traffic_server is dead

Entering safe_kill

Leaving safe_kill

Entering spwan_manager()!

Leaving spwan_manager()!

Leaving check_programs

----------------traffic_manager is running, pid:'5463'!

Entering spwan_server()!

Leaving spwan_server()!

==============traffic_server is running, pid:'5467'!

从日志中可以看出,某个时刻,traffic_manager进程ID是5436,traffic_server进程ID是5443;下一时刻中,traffic_manager进程出现了异常(---------------traffic_manager has a expcetion and eixt!),然后traffic_cop在周期性的check_programs()中发现” traffic_manager not running”,然后它杀死了traffic_server进程(” making sure traffic_server is dead”),然后重新创建了traffic_manager进程(” Entering spwan_manager()!”),traffic_manager进程的ID已经变成了5463,traffic_manager正常运行后,发现traffic_server进程没有运行,随后它调用spwan_server()产生新的traffic_server进程,其ID号变成了5467。说明traffic_cop监控功能正常。

 

       当traffic_server进程异常退出的时候,traffic_manager进程会检测到这一行为,然后重启traffic_server进程,在日志文件中也可以看出这一动作:(日志部分内容如下)

==============traffic_server is running, pid:'7703'!

----------------traffic_manager is running, pid:'7699'!

=================traffic_server has a expcetion and exit!

Entering safe_kill

Leaving safe_kill

--------------Entering spwan_server()!

--------------Leaving spwan_server()!

 

----------------traffic_manager is running, pid:'7699'!

==============traffic_server is running, pid:'7712'!

从日志上可以看出,某时刻,traffic_manager进程ID为7699,traffic_server进程ID是7703,接下来traffic_server进程出现异常退出,traffic_manager进程则调用spwan_server()重新开启了一个traffic_server进程,ID号为7712,此时traffic_manager进程的ID号仍然是7699,说明traffic_manager进程没有改变。这说明traffic_manager起到了监控traffic_server进程的作用。

4.总结

     为什么设计了三个进程来工作,而不是采用两个进程:直接让traffic_manager进程来监管traffic_server进程。由于traffic_manager进程所负担的系统角色说明单独的两个进程是无法满足系统要求的。特别是当traffic_manager进程检测到traffic_server进程失败的时候,它会暂时将请求放入队列中,所以它也需要在端口上暂时监听请求,这样系统就无法保障该进程不会出现异常,这也意味着traffic_manager进程同样也会出现异常。为此系统设计了traffic_cop守护进程来监控,traffic_cop进程的角色就是纯粹的监控另外两个进程,理论上这个守护进程是不会异常结束的,这样的三层设计比两层设计更安全更可靠。当三个进程协同工作的时候,客户对于服务器的异常是透明的(设计上如此,但并非绝对,当traffic_manager与traffic_server同时异常结束的时候,traffic_cop在重启它们的几秒钟内,客户的请求会无法接收,小概率),客户是不会感知到自己的请求会出现问题的,可能会感觉延迟大了一些。从服务器的架构设计上可以看出,服务器的要求是尽可能的稳定安全,对于异常情况的考虑应周全。

 源代码:

1.lock_and_kill.h

  1 #ifndef LOCK_AND_KILL_H
  2 #define LOCK_AND_KILL_H
  3 #include <sys/types.h>
  4 #include <string.h>
  5 #define PATH_NAME_MAX 4096
  6 
  7 /*-------------------------------------------------------------------------
  8    ink_killall
  9    - Sends signal 'sig' to all processes with the name 'pname'
 10    - Returns: -1 error
 11                0 okay
 12   -------------------------------------------------------------------------*/
 13 int ink_killall(const char *pname, int sig);
 14 
 15 /*-------------------------------------------------------------------------
 16    ink_killall_get_pidv_xmalloc
 17    - Get all pid's named 'pname' and stores into ats_malloc'd
 18      pid_t array, 'pidv'
 19    - Returns: -1 error (pidv: set to NULL; pidvcnt: set to 0)
 20                0 okay (pidv: ats_malloc'd pid vector; pidvcnt: number of pid's;
 21                    if pidvcnt is set to 0, then pidv will be set to NULL)
 22   -------------------------------------------------------------------------*/
 23 int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt);
 24 
 25 /*-------------------------------------------------------------------------
 26    ink_killall_kill_pidv
 27    - Kills all pid's in 'pidv' with signal 'sig'
 28    - Returns: -1 error
 29                0 okay
 30   -------------------------------------------------------------------------*/
 31 int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig);
 32 
 33 
 34 
 35 class Lockfile
 36 {
 37 public:
 38     
 39   Lockfile(void):fd(0)
 40   {
 41     fname[0] = '\0';
 42   }
 43 
 44 
 45   // coverity[uninit_member]
 46   Lockfile(const char *filename):fd(0)
 47   {
 48     strcpy(fname, filename);
 49   }
 50 
 51 
 52   ~Lockfile(void)
 53   {
 54   }
 55 
 56   void SetLockfileName(const char *filename)
 57   {
 58     strcpy(fname, filename);
 59   }
 60 
 61   const char *GetLockfileName(void)
 62   {
 63     return fname;
 64   }
 65 
 66   // Open() -----非常重要的函数
 67   //
 68   // Tries to open a lock file, returning:
 69   //   -errno on error
 70   //   0 if someone is holding the lock (with holding_pid set)
 71   //   1 if we now have a writable lock file
 72   int Open(pid_t * holding_pid);
 73 
 74   // Get()
 75   //
 76   // Gets write access to a lock file, and if successful, truncates
 77   // file, and writes the current process ID.  Returns:
 78   //   -errno on error
 79   //   0 if someone is holding the lock (with holding_pid set)
 80   //   1 if we now have a writable lock file
 81   int Get(pid_t * holding_pid);
 82 
 83   // Close()
 84   //
 85   // Closes the file handle on the opened Lockfile.
 86   void Close(void);
 87 
 88   // Kill()
 89   // KillGroup()
 90   //
 91   // Ensures no one is holding the lock. It tries to open the lock file
 92   // and if that does not succeed, it kills the process holding the lock.
 93   // If the lock file open succeeds, it closes the lock file releasing
 94   // the lock.
 95   //
 96   // The intial signal can be used to generate a core from the process while
 97   // still ensuring it dies.
 98   void Kill(int sig, int initial_sig = 0, const char *pname = NULL);
 99   void KillGroup(int sig, int initial_sig = 0, const char *pname = NULL);
100 
101 private:
102   char fname[PATH_NAME_MAX];
103   int fd;
104 };
105 
106 
107 #endif

2.lock_and_kill.cpp

  1 #include <stdio.h>
  2 #include <stdlib.h>
  3 #include <dirent.h>
  4 #include<unistd.h> 
  5 #include<sys/file.h>
  6 #include <errno.h>
  7 #include <signal.h>
  8 
  9 #include "lock_and_kill.h"
 10 
 11 
 12 #define PROC_BASE "/proc"
 13 #define INITIAL_PIDVSIZE 32
 14 #define LOCKFILE_BUF_LEN 16 
 15 #define LINE_MAX 1024 //may be hava problem with it
 16 int
 17 ink_killall(const char *pname, int sig)
 18 {
 19   int err;
 20   pid_t *pidv;
 21   int pidvcnt;
 22   
 23   if (ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt) < 0) {
 24     return -1;
 25   }
 26 
 27   if (pidvcnt == 0) {
 28     free(pidv);
 29     return 0;
 30   }
 31 
 32   err = ink_killall_kill_pidv(pidv, pidvcnt, sig);
 33   free(pidv);
 34   return err;
 35 }
 36 
 37 int
 38 ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt)
 39 {
 40   DIR *dir;
 41   FILE *fp;
 42   struct dirent *de;
 43   pid_t pid, self;
 44   char buf[LINE_MAX], *p, *comm;
 45   int pidvsize = INITIAL_PIDVSIZE;
 46 
 47   if (!pname || !pidv || !pidvcnt)
 48     goto l_error;
 49 
 50   self = getpid();
 51   if (!(dir = opendir(PROC_BASE)))
 52     goto l_error;
 53 
 54   *pidvcnt = 0;
 55   *pidv = (pid_t *)malloc(pidvsize * sizeof(pid_t));
 56 
 57   while ((de = readdir(dir))) {
 58     if (!(pid = (pid_t) atoi(de->d_name)) || pid == self)
 59       continue;
 60     snprintf(buf, sizeof(buf), PROC_BASE "/%d/stat", pid);
 61     if ((fp = fopen(buf, "r"))) {
 62       if (fgets(buf, sizeof buf, fp) == 0)
 63         goto l_close;
 64       if ((p = strchr(buf, '('))) {
 65         comm = p + 1;
 66         if ((p = strchr(comm, ')')))
 67           *p = '\0';
 68         else
 69           goto l_close;
 70         if (strcmp(comm, pname) == 0) {
 71           if (*pidvcnt >= pidvsize) {
 72             pid_t *pidv_realloc;
 73             pidvsize *= 2;
 74             if (!(pidv_realloc = (pid_t *)realloc(*pidv, pidvsize * sizeof(pid_t)))) {
 75               free(*pidv);
 76               goto l_error;
 77             } else {
 78               *pidv = pidv_realloc;
 79             }
 80           }
 81           (*pidv)[*pidvcnt] = pid;
 82           (*pidvcnt)++;
 83         }
 84       }
 85     l_close:
 86       fclose(fp);
 87     }
 88   }
 89   closedir(dir);
 90 
 91   if (*pidvcnt == 0) {
 92     free(*pidv);
 93     *pidv = 0;
 94   }
 95   return 0;
 96 l_error:
 97   *pidv = NULL;
 98   *pidvcnt = 0;
 99   return -1;
100 }
101 
102 int
103 ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig)
104 {
105   int err = 0;
106   if (!pidv || (pidvcnt <= 0))
107     return -1;
108   while (pidvcnt > 0) {
109     pidvcnt--;
110     if (kill(pidv[pidvcnt], sig) < 0)
111       err = -1;
112   }
113   return err;
114 }
115 
116 
117 ////////////////////类函数的实现在下面//////////////////////////////////
118 ////////////////////////////////////////////////////////////////////////
119 int
120 Lockfile::Open(pid_t * holding_pid)
121 {
122   char buf[LOCKFILE_BUF_LEN];
123   pid_t val;
124   int err;
125   *holding_pid = 0;
126 
127 #define FAIL(x) \
128 { \
129   if (fd > 0) \
130     close (fd); \
131   return (x); \
132 }
133 
134   struct flock lock;
135   char *t;
136   int size;//开始的时候设置成无效的一个值
137 
138   // Try and open the Lockfile. Create it if it does not already
139   // exist.
140   do {
141     fd = open(fname, O_RDWR | O_CREAT, 0644);
142   } while ((fd < 0) && (errno == EINTR));
143 
144   if (fd < 0)
145     return (-errno);
146 
147   // Lock it. Note that if we can't get the lock EAGAIN will be the
148   // error we receive.
149   lock.l_type = F_WRLCK;
150   lock.l_start = 0;
151   lock.l_whence = SEEK_SET;
152   lock.l_len = 0;
153 
154   do {
155     err = fcntl(fd, F_SETLK, &lock);
156   } while ((err < 0) && (errno == EINTR));
157 
158   if (err < 0) {
159     // We couldn't get the lock. Try and read the process id of the
160     // process holding the lock from the lockfile.
161     t = buf;
162 
163     for (size = 15; size > 0;) {
164       do {
165         err = read(fd, t, size);
166       } while ((err < 0) && (errno == EINTR));
167 
168       if (err < 0)
169         FAIL(-errno);
170       if (err == 0)
171         break;
172 
173       size -= err;
174       t += err;
175     }
176     *t = '\0';
177 
178     // coverity[secure_coding]
179     if (sscanf(buf, "%d\n", (int*)&val) != 1) {
180       *holding_pid = 0;
181     } else {
182       *holding_pid = val;
183     }
184     FAIL(0);
185     
186   }
187   // If we did get the lock, then set the close on exec flag so that
188   // we don't accidently pass the file descriptor to a child process
189   // when we do a fork/exec.
190   do {
191     err = fcntl(fd, F_GETFD, 0);
192   } while ((err < 0) && (errno == EINTR));
193 
194   if (err < 0)
195     FAIL(-errno);
196   
197   val = err | FD_CLOEXEC;
198 
199   do {
200     err = fcntl(fd, F_SETFD, val);
201   } while ((err < 0) && (errno == EINTR));
202 
203   if (err < 0)
204     FAIL(-errno);
205 
206   // Return the file descriptor of the opened lockfile. When this file
207   // descriptor is closed the lock will be released.
208   return (1);                   // success
209 #undef FAIL
210 }
211 
212 int
213 Lockfile::Get(pid_t * holding_pid)
214 {
215   char buf[LOCKFILE_BUF_LEN];
216   int err;
217   *holding_pid = 0;
218 
219   fd = -1;
220 
221   // Open the Lockfile and get the lock. If we are successful, the
222   // return value will be the file descriptor of the opened Lockfile.
223   err = Open(holding_pid);
224   if (err != 1)
225     return err;
226 
227   if (fd < 0) {
228     return -1;
229   }
230 
231   // Truncate the Lockfile effectively erasing it.
232   do {
233     err = ftruncate(fd, 0);
234   } while ((err < 0) && (errno == EINTR));
235 
236   if (err < 0) {
237     close(fd);
238     return (-errno);
239   }
240 
241   // Write our process id to the Lockfile.
242   snprintf(buf, sizeof(buf), "%d\n", (int) getpid());
243 
244   do {
245     err = write(fd, buf, strlen(buf));
246   } while ((err < 0) && (errno == EINTR));
247 
248   if (err != (int) strlen(buf)) {
249     close(fd);
250     return (-errno);
251   }
252   return (1);                   // success
253 }
254 
255 void
256 Lockfile::Close(void)
257 {
258   if (fd != -1) {
259     close(fd);
260   }
261 }
262 
263 //-------------------------------------------------------------------------
264 // Lockfile::Kill() and Lockfile::KillAll()
265 //
266 // Open the lockfile. If we succeed it means there was no process
267 // holding the lock. We'll just close the file and release the lock
268 // in that case. If we don't succeed in getting the lock, the
269 // process id of the process holding the lock is returned. We
270 // repeatedly send the KILL signal to that process until doing so
271 // fails. That is, until kill says that the process id is no longer
272 // valid (we killed the process), or that we don't have permission
273 // to send a signal to that process id (the process holding the lock
274 // is dead and a new process has replaced it).
275 //
276 // INKqa11325 (Kevlar: linux machine hosed up if specific threads
277 // killed): Unfortunately, it's possible on Linux that the main PID of
278 // the process has been successfully killed (and is waiting to be
279 // reaped while in a defunct state), while some of the other threads
280 // of the process just don't want to go away.  Integrate ink_killall
281 // into Kill() and KillAll() just to make sure we really kill
282 // everything and so that we don't spin hard while trying to kill a
283 // defunct process.
284 //-------------------------------------------------------------------------
285 
286 
287 static void
288 lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig)
289 {
290   int err;
291 
292 #if defined(linux)
293 
294   pid_t *pidv;
295   int pidvcnt;
296 
297   // Need to grab pname's pid vector before we issue any kill signals.
298   // Specifically, this prevents the race-condition in which
299   // traffic_manager spawns a new traffic_server while we still think
300   // we're killall'ing the old traffic_server.
301   if (pname) {
302       //这函数的功能是什么,将程序名为pname的进程都不给杀死,pidv是pid的数组指针,pidvcnt是进程个数
303     ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt);
304   }
305 
306   if (init_sig > 0) {
307     kill(init_pid, init_sig);
308     // sleep for a bit and give time for the first signal to be
309     // delivered
310     sleep(1);
311   }
312 
313   do {
314     if ((err = kill(pid, sig)) == 0) {
315       sleep(1);
316     }
317     if (pname && (pidvcnt > 0)) {
318       ink_killall_kill_pidv(pidv, pidvcnt, sig);
319       sleep(1);
320     }
321   } while ((err == 0) || ((err < 0) && (errno == EINTR)));
322 
323   free(pidv);
324 
325 #else
326 
327   if (init_sig > 0) {
328     kill(init_pid, init_sig);
329     // sleep for a bit and give time for the first signal to be
330     // delivered
331     sleep(1);
332   }
333 
334   do {
335     err = kill(pid, sig);
336   } while ((err == 0) || ((err < 0) && (errno == EINTR)));
337 
338 #endif  // linux check
339 
340 }
341 
342 /////////////////////////////////////////////////////////////////
343 /////////////////////////////////////////////////////////////////
344 void
345 Lockfile::Kill(int sig, int initial_sig, const char *pname)
346 {
347   int err;
348   int pid;
349   pid_t holding_pid;
350 
351   err = Open(&holding_pid);
352   if (err == 1)                 // success getting the lock file,说明没有对应的server进程存在
353   {
354     Close();                    //因此不需要处理,关闭就行了
355   } else if (err == 0)          // someone else has the lock
356   {
357     pid = holding_pid;          //获取持有锁进程的pid
358     if (pid != 0) {             //当进程pid有效的时候,就去杀死这个进程
359     
360       lockfile_kill_internal(pid, initial_sig, pid, pname, sig);
361     }
362   }
363 }
364 
365 
366 /////////////////////////////////////////////////////////////////////
367 /////////////////////////////////////////////////////////////////////
368 //没怎么明白这个函数!!
369 void
370 Lockfile::KillGroup(int sig, int initial_sig, const char *pname)
371 {
372   int err;
373   pid_t pid;
374   pid_t holding_pid;
375 
376   err = Open(&holding_pid);
377   if (err == 1)                 // success getting the lock file
378   {
379     Close();
380   } else if (err == 0)          // someone else has the lock
381   {
382     do {
383       pid = getpgid(holding_pid);//获得进程组识别码
384     } while ((pid < 0) && (errno == EINTR));
385 
386     if ((pid < 0) || (pid == getpid()))
387       pid = holding_pid;
388     else
389       pid = -pid;
390 
391     if (pid != 0) {
392       // We kill the holding_pid instead of the process_group
393       // initially since there is no point trying to get core files
394       // from a group since the core file of one overwrites the core
395       // file of another one
396       lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig);
397     }
398   }
399 }

3.log.h

 1 #ifndef LOG_H
 2 #define LOG_H
 3 #include <stdio.h>
 4 
 5 void write_to_log(char* c){
 6 
 7     FILE* fd;
 8     fd = fopen("log.txt", "ab"); 
 9     if (fd)
10       {
11         fputs(c, fd); 
12         fclose(fd);
13       }
14 }
15 
16 #endif

4.traffic_cop.cpp

  1 #include "lock_and_kill.h"
  2 #include "log.h"
  3 #include <sys/types.h>
  4 #include <sys/ipc.h>
  5 #include <sys/sem.h>
  6 #include <signal.h>
  7 #include <sys/param.h>
  8 #include <unistd.h>
  9 #include <stdlib.h>
 10 #include <sys/wait.h>
 11 #include <time.h>
 12 #include <string.h>
 13 #include <stdio.h>
 14 #include <sys/stat.h> 
 15 
 16 
 17 #define    NOWARN_UNUSED(x)    (void)(x)
 18 
 19 static char cop_lockfile[PATH_NAME_MAX];
 20 static char manager_lockfile[PATH_NAME_MAX];
 21 static char server_lockfile[PATH_NAME_MAX];
 22 
 23 static char manager_binary[PATH_NAME_MAX] = "traffic_manager";
 24 static char server_binary[PATH_NAME_MAX] = "traffic_server";
 25 static int killsig=SIGKILL;
 26 static int coresig=0;
 27 static int server_not_found = 0;
 28 static int server_failures=0;
 29 static int manager_failures =0;
 30 
 31 static const int sleep_time = 10;       // 10 sec
 32 static const int manager_timeout = 3 * 60;      //  3 min
 33 static const int server_timeout = 3 * 60;       //  3 min
 34 static const int kill_timeout = 1 * 60; //  1 min
 35 
 36 
 37 static void sig_alarm_warn(int signum=0)
 38 {
 39      alarm(kill_timeout);
 40 }
 41 
 42 
 43 static void sig_fatal(int signum)
 44 {
 45     abort();
 46 }
 47 
 48 
 49 static void set_alarm_warn()
 50 {
 51     struct sigaction action;
 52     action.sa_handler = sig_alarm_warn;
 53      sigemptyset(&action.sa_mask);
 54      action.sa_flags = 0;
 55     sigaction(SIGALRM, &action, NULL);
 56 }
 57 
 58 static void set_alarm_death()
 59 {
 60     struct sigaction action;
 61     action.sa_handler = sig_fatal;
 62       sigemptyset(&action.sa_mask);
 63       action.sa_flags = 0;
 64     sigaction(SIGALRM, &action, NULL);
 65 }
 66 
 67 static void sig_child(int signum)
 68 {
 69   NOWARN_UNUSED(signum);
 70   pid_t pid = 0;
 71   int status = 0;
 72   for (;;) {
 73     pid = waitpid(WAIT_ANY, &status, WNOHANG);
 74 
 75     if (pid <= 0) {
 76       break;
 77     }
 78     // TSqa03086 - We can not log the child status signal from
 79     //   the signal handler since syslog can deadlock.  Record
 80     //   the pid and the status in a global for logging
 81     //   next time through the event loop.  We will occasionally
 82     //   lose some information if we get two sig childs in rapid
 83     //   succession
 84    // child_pid = pid;
 85     //child_status = status;
 86   }
 87 }
 88 
 89 
 90 static void init_signals()
 91 {
 92       struct sigaction action;
 93       write_to_log("Entering init_signals()\n");
 94       action.sa_handler = sig_child;
 95       sigemptyset(&action.sa_mask);
 96       action.sa_flags = 0;
 97       sigaction(SIGCHLD, &action, NULL);
 98       action.sa_handler = sig_fatal;
 99       sigemptyset(&action.sa_mask);
100       action.sa_flags = 0;
101       write_to_log("leaving init_signals()\n\n");
102 }
103 
104 
105 static void safe_kill(const char* lockfile_name,const char * pname,bool group)
106 {
107     Lockfile lockfile(lockfile_name);
108     write_to_log("Entering safe_kill\n");
109     set_alarm_warn();
110       alarm(kill_timeout);
111 
112       if (group == true) {
113         lockfile.KillGroup(killsig, coresig, pname);
114       } else {
115         lockfile.Kill(killsig, coresig, pname);
116       }
117       alarm(0);
118       set_alarm_death();
119      write_to_log("Leaving safe_kill\n\n");
120 
121 }
122 
123 
124 //为了简单化,直接返回0
125 static int server_up()
126 {
127     return 1;
128 
129 }
130 
131 
132 static int heartbeat_manager()
133 {
134     //safe_kill(manager_lockfile, manager_binary, true);
135     return 1;
136 }
137 
138 static int heartbeat_server()
139 {
140     //safe_kill(server_lockfile, server_binary, false);
141     //server_failures = 0;
142     return 1;
143 }
144 
145 
146 
147 static void spawn_manager()
148 {
149       int err;
150       int key;
151       err = fork();
152   write_to_log("Entering spwan_manager()!\n\n");
153   if (err == 0) {
154     err = execv(manager_binary, NULL);
155   write_to_log("somehow execv failed!\n");
156     exit(1);
157   } else if (err == -1) {
158     write_to_log("unable to fork !\n");
159     exit(1);
160   } 
161   
162   manager_failures = 0;
163   write_to_log("Leaving spwan_manager()!\n\n");
164 }
165 
166 
167 static void init_lockfiles()
168 {
169  // Layout::relative_to(cop_lockfile, sizeof(cop_lockfile), Layout::get()->runtimedir, COP_LOCK);
170  // Layout::relative_to(manager_lockfile, sizeof(manager_lockfile), Layout::get()->runtimedir,      MANAGER_LOCK);
171  // Layout::relative_to(server_lockfile, sizeof(server_lockfile), Layout::get()->runtimedir, SERVER_LOCK);
172 
173  write_to_log("Entering init_lockfiles()\n");
174  strcpy(cop_lockfile,"cop_lockfile");
175  strcpy(manager_lockfile,"manager_lockfile");
176  strcpy(server_lockfile,"server_lockfile");
177 
178  strcpy(manager_binary,"manager_binary");
179  strcpy(server_binary,"server_binary");
180 
181 
182  write_to_log("leaving init_lockfiles()\n\n");
183 
184  //manager_lockfile="manager_lockfile";
185  //server_lockfile="server_lockfile";
186  //manager_binary="manager_binary";
187  //server_binary="server_binary";
188 
189 }
190 
191 
192 static void check_lockfile()
193 {
194 
195   write_to_log("Entering check_lockfile()\n");
196   int err;
197   pid_t holding_pid;
198   Lockfile cop_lf(cop_lockfile);
199   err = cop_lf.Get(&holding_pid);
200 
201 
202   if (err < 0) {
203     write_to_log("leaving check_lockfile(),and err<0\n\n");
204     exit(1);
205   } else if (err == 0) {
206     write_to_log("leaving check_lockfile(),and err==0\n\n");
207     exit(1);
208   }
209     write_to_log("leaving check_lockfile()\n\n");
210 
211 }
212 
213 
214 
215 static void check_programs()
216 {
217     int err;
218     pid_t holding_pid;
219 
220     write_to_log("Entering check_programs()\n");
221     printf("Entering check_programs()\n");
222   //尝试去获取 manager的lockfile,如果成功,说明没有manager进程在运行
223     Lockfile manager_lf(manager_lockfile);
224         err = manager_lf.Open(&holding_pid);
225 
226    //通过检测err的值来判断manager进程的运行情况
227    if(err==0){
228         write_to_log("in check_programs(),manager_lockfile,err==0\n");
229 
230         printf("in check_programs(),manager_lockfile,err==0\n");
231         
232         if(kill(holding_pid,0)==-1){
233           
234            printf("holding_pid is %d,and invalid\n",holding_pid);
235 
236                 ink_killall(manager_binary, killsig);
237                 sleep(1);                 // give signals a chance to be received 
238                  err = manager_lf.Open(&holding_pid);
239             }
240 
241    }
242 
243 
244     if(err>0){//说明可以获得manager lockfile
245         // 'lockfile_open' returns the file descriptor of the opened
246         // lockfile.  We need to close this before spawning the
247         // manager so that the manager can grab the lock. 
248             manager_lf.Close(); 
249             // Make sure we don't have a stray traffic server running.
250 
251             write_to_log("traffic_manager not running, making sure traffic_server is dead\n");
252             safe_kill(server_lockfile,server_binary,false);
253             spawn_manager();
254     }
255     else
256     {
257 
258             
259             
260 
261             //err<0,Open中返回负值,说明可能是加锁成功,但是设置lockfile的文件信息失败
262             // If there is a manager running we want to heartbeat it to
263             // make sure it hasn't wedged. If the manager test succeeds we
264             // check to see if the server is up. (That is, it hasn't been
265             // brought down via the UI).  If the manager thinks the server
266             // is up, we make sure there is actually a server process
267             // running. If there is we test it.
268 
269                 alarm(2*manager_timeout);
270                 err=heartbeat_manager();//?
271                 alarm(0);
272 
273                 if(err<0){//???what case
274                     return ;
275 
276                     }
277 
278                 
279                 if(server_up()<=0){//???what case
280                     return;//err>0 ,manager is running ,if server is down  we think manager can create a new server ,so return
281                     }
282 
283                 Lockfile server_lf(server_lockfile);
284                 err=server_lf.Open(&holding_pid);
285 
286                 if(err==0){
287                     if(kill(holding_pid,0)==-1){
288                         ink_killall(server_binary,killsig);
289                         sleep(1);// give signals a chance to be received
290                         err=server_lf.Open(&holding_pid);
291                         }
292                     }
293 
294                 if(err>0){
295                     server_lf.Close();
296                     server_not_found += 1;
297 
298                     if(server_not_found>1){
299 
300 
301                         server_not_found=0;
302                         safe_kill(manager_lockfile, manager_binary, true);
303                         }
304                 }else{
305                           alarm(2 * server_timeout);
306                                 heartbeat_server();//?
307                               alarm(0);
308 
309                         }
310                 
311     }
312    printf("Leaving check_programs\n\n");
313    write_to_log("Leaving check_programs\n\n");
314 }
315 
316 
317 static void init()
318 {    
319     write_to_log("Entering init()\n");
320     init_signals();
321         init_lockfiles();
322         check_lockfile();
323     write_to_log("Leaving init()\n\n");
324 }
325 
326 static void millisleep(int ms)
327 {
328   struct timespec ts;
329   ts.tv_sec = ms / 1000;
330   ts.tv_nsec = (ms - ts.tv_sec * 1000) * 1000 * 1000;
331   nanosleep(&ts, NULL);
332 }
333 
334 // Changed function from taking no argument and returning void
335 // to taking a void* and returning a void*. The change was made
336 // so that we can call ink_thread_create() on this function
337 // in the case of running cop as a win32 service.
338 
339 static void* check(void* arg)
340 {
341     //bool mgmt_init=false;
342     write_to_log("Entering check()\n\n");
343     for(;;){
344         
345         // problems with the ownership of this file as root Make sure it is
346         // owned by the admin user
347         
348         alarm(2 * (sleep_time + manager_timeout * 2 + server_timeout));
349 
350         check_programs();
351         millisleep(sleep_time * 1000);
352         }
353     write_to_log("Leaveing check()\n\n");
354     return arg;
355 }
356 
357 void init_daemon(void) 
358 { 
359     int i; 
360     pid_t pid;
361     struct rlimit rl;
362     struct sigaction sa;
363     //printf("------------------------------\n");
364     //umask(0);
365     if(getrlimit(RLIMIT_NOFILE,&rl)<0){
366         exit(1);
367     }
368 
369 
370     if((pid=fork())<0){
371         exit(1);//fork失败,退出 
372     }else if(pid> 0){ 
373         exit(0);//是父进程,结束父进程 
374         }
375 
376     //是第一子进程,后台继续执行 
377     setsid();//第一子进程成为新的会话组长和进程组长 
378     //并与控制终端分离 
379     sa.sa_handler=SIG_IGN;
380     sigemptyset(&sa.sa_mask);
381     sa.sa_flags=0;
382 
383     if(sigaction(SIGHUP,&sa,NULL)<0){
384         exit(1);
385     }
386 
387     if((pid=fork())<0){
388         exit(1);//fork失败,退出 
389     }else if(pid> 0){ 
390         exit(0);//是父进程,结束父进程 
391         }
392     //是第二子进程,继续 
393     //第二子进程不再是会话组长 
394     umask(0);
395     if (rl.rlim_max==RLIM_INFINITY){
396         rl.rlim_max=1024;
397 
398     }
399 
400     for(i=0;i< rl.rlim_max;++i)//关闭打开的文件描述符 
401          {         
402         close(i);
403           } 
404 
405     //chdir("/tmp");//改变工作目录到/tmp 
406     return; 
407 } 
408 
409 
410 int main()
411 {
412 
413     init_daemon();//守护进程初始化函数
414       write_to_log("Entering main()\n");
415       signal(SIGHUP, SIG_IGN);
416       signal(SIGTSTP, SIG_IGN);
417       signal(SIGTTOU, SIG_IGN);
418       signal(SIGTTIN, SIG_IGN);
419       //setsid(); 
420       init();
421         check(NULL);
422       write_to_log("leaving main()\n\n");
423        return 0;
424 }


5.traffic_manager.cpp

  1 #include "lock_and_kill.h"
  2 #include "log.h"
  3 #include <sys/types.h>
  4 #include <sys/ipc.h>
  5 #include <sys/sem.h>
  6 #include <signal.h>
  7 #include <unistd.h>
  8 #include <stdlib.h>
  9 #include <sys/wait.h>
 10 #include <time.h>
 11 #include <string.h>
 12 #include <stdio.h>
 13 
 14 #define    NOWARN_UNUSED(x)    (void)(x)
 15 static char manager_lockfile[4096]="manager_lockfile";
 16 static char server_lockfile[4096]="server_lockfile";
 17 static int server_failures=0;
 18 static int killsig=SIGKILL;
 19 static int coresig=0;
 20 static char server_binary[4096] = "server_binary";
 21 static const int sleep_time = 10;       // 10 sec
 22 static const int manager_timeout = 3 * 60;      //  3 min
 23 static const int server_timeout = 3 * 60;       //  3 min
 24 static const int kill_timeout = 1 * 60; //  1 min
 25 
 26 static void sig_alarm_warn(int signum=0)
 27 {
 28      alarm(kill_timeout);
 29 }
 30 
 31 
 32 static void sig_fatal(int signum)
 33 {
 34     abort();
 35 }
 36 
 37 
 38 static void set_alarm_warn()
 39 {
 40     struct sigaction action;
 41     action.sa_handler = sig_alarm_warn;
 42      sigemptyset(&action.sa_mask);
 43      action.sa_flags = 0;
 44     sigaction(SIGALRM, &action, NULL);
 45 }
 46 
 47 static void set_alarm_death()
 48 {
 49     struct sigaction action;
 50     action.sa_handler = sig_fatal;
 51       sigemptyset(&action.sa_mask);
 52       action.sa_flags = 0;
 53     sigaction(SIGALRM, &action, NULL);
 54 }
 55 
 56 static void sig_child(int signum)
 57 {
 58   NOWARN_UNUSED(signum);
 59   pid_t pid = 0;
 60   int status = 0;
 61   for (;;) {
 62     pid = waitpid(WAIT_ANY, &status, WNOHANG);
 63 
 64     if (pid <= 0) {
 65       break;
 66     }
 67     // TSqa03086 - We can not log the child status signal from
 68     //   the signal handler since syslog can deadlock.  Record
 69     //   the pid and the status in a global for logging
 70     //   next time through the event loop.  We will occasionally
 71     //   lose some information if we get two sig childs in rapid
 72     //   succession
 73    // child_pid = pid;
 74     //child_status = status;
 75   }
 76 }
 77 
 78 static void safe_kill(const char* lockfile_name,const char * pname,bool group)
 79 {
 80     Lockfile lockfile(lockfile_name);
 81     write_to_log("Entering safe_kill\n");
 82     set_alarm_warn();
 83       alarm(kill_timeout);
 84 
 85       if (group == true) {
 86         lockfile.KillGroup(killsig, coresig, pname);
 87       } else {
 88         lockfile.Kill(killsig, coresig, pname);
 89       }
 90       alarm(0);
 91       set_alarm_death();
 92       write_to_log("Leaving safe_kill\n\n");
 93 
 94 }
 95 
 96 static void spawn_server()
 97 {
 98       int err;
 99       int key;
100   write_to_log("--------------Entering spwan_server()!\n\n");
101       err = fork();
102   if (err == 0) {
103     err = execv(server_binary, NULL);
104     
105     write_to_log("--------------somehow execv failed!\n");
106        exit(1);
107   } else if (err == -1) {
108         write_to_log("--------------unable to fork server !\n");
109        exit(1);
110   } 
111   
112   server_failures = 0;
113   write_to_log("--------------Leaving spwan_server()!\n\n");
114 }
115 
116 
117 void check_server()
118 {
119     int err;
120     pid_t holding_pid;
121     Lockfile server_lf(server_lockfile);
122     err=server_lf.Get(&holding_pid);
123 
124     if(err==0){
125         if(kill(holding_pid,0)==-1){
126             ink_killall(server_binary,killsig);
127             sleep(1);
128             err=server_lf.Open(&holding_pid);
129             }
130 
131         }
132 
133     if(err>0){
134         server_lf.Close();
135         safe_kill(server_lockfile,server_binary,false);
136         spawn_server();
137 
138         }
139 
140 }
141 
142 
143 
144 
145 int main()
146 {
147     pid_t holding_pid=0;
148     Lockfile manager_lf(manager_lockfile);
149     manager_lf.Get(&holding_pid);
150 
151     while(1){
152 
153         char buf[100];
154         sprintf(buf,"----------------traffic_manager is running, pid:'%d'!\n",getpid());
155         write_to_log(buf);
156         
157         printf("----------------traffic_manager is running,pidID: %d\n",getpid());
158 
159         sleep(5);
160         int c=rand()%10;
161         
162         if(c==1){//模拟manager进程出现状况
163             write_to_log("----------------traffic_manager has a expcetion and eixt!\n");
164             exit(1);
165         }else{//对server进程进行检查
166             check_server();
167         }
168         }
169 }


6.traffic_server.cpp

 1 #include "log.h"
 2 #include "lock_and_kill.h"
 3 #include <sys/types.h>
 4 #include <unistd.h>
 5 #include <stdlib.h>
 6 
 7 
 8 static char server_lockfile[4096]="server_lockfile";
 9 
10 int main()
11 {
12 
13         pid_t holding_pid=0;
14         Lockfile server_lf(server_lockfile);
15         server_lf.Get(&holding_pid);
16 
17         while(1){
18 
19             char buf[100];
20         sprintf(buf,"==============traffic_server is running, pid:'%d'!\n",getpid());
21         write_to_log(buf);
22             sleep(5);
23             int c=rand()%100;
24             
25             if(c<30){//模拟server进程出现状况
26                 write_to_log("=================traffic_server has a expcetion and exit!\n");
27                 exit(1);
28             }
29         }
30         return 0;
31 
32 }

以上文档为以前研究时所写,希望能给感兴趣的同学一点帮助,同时也请大家指点。我这里时简要的分析了traffic进程控制的问题,测试中许多是简化的,比如心跳测试之类的,代码中有说明。

 

 

posted @ 2013-02-26 11:49  刘少东的博客  阅读(1674)  评论(0编辑  收藏  举报